Stat 417 Lecture Notes

Chapter 1. Introduction and Overview
Chapter 2. Probability
- The Fundamental Principle of Counting
- The Number of Combinations
- Review of Set Theroy
- Sample Space, Events, and Probability
- Probability Rules
- Joint Probability and Contingency Tables
- The Law of Total Probability
- The Bayes’ Rule
Chapter 3. Discrete Random Variables and Their Distributions
- Discrete Distributions and Their Characteristics
- The Covariance and Correlation between Two Random Variables
- Special Discrete Distributions
Chapter 4. Continuous Distributions
- The Probability Density Function
- The Uniform Distribution
- The exponential Distribution
- The Gamma Distribution
- The Normal Distributions
Chapter 5. Computer Simulations and Monte Carlo Methods
- What is R?
- Coding in R
- An Exercise
Basics of R Programming
- Logical Operators, Logical Expressions, and Logical Values
- The if-else Conditional Statement
- The for and while Loops
- R Functions
The Pipe Operator %>%
Basic Data Structures in R
- Vectors
- Data Frames
- Lists
- Matrics
Writing Your Own Functions
- An Exercise
Installing and Loading Packages
DIY
- Question 1
- Question 2
- Question 3
Generating Samples from a Discrete Distribution using R
Generating Random Numbers from a Continuous Distribution: The Inverse Transform Method
Generating Random Numbers from a Continuous Distribution: The Rejection Method
Chapter 6. Stochastic Processes
- Classifications of Processes
- Markov Processes and Markov Chains
Counting Processes
The Binomial Process
The Poisson Process
Simulation of Stochastic Processes
Chapter 7 Queneing Systems

Chapter 1. Introduction and Overview

When tossing a coin twice, we may get a sum of 2, 3, 4, …, or 12 points, so the result is uncertain. What is the most likely sum?

Uncertainty refers to the situation where results or outcomes are not completely determined and depend on a number of factors and pure chance. It exists every moment in everyday life. Probability is a major concept developed for studying uncertainty.

A probability course differs from a statistics course.

Probability: in probability we start with a mechanism (called a probability model P), and we deduce properties of P. For example, imagine tossing a fair coin 10 times. What is the probability of getting 8 heads? Here, we do not actually toss the coin.
Statistics: in statistics we have data, which we regard as having been generated from some unknown probability model P. We want to be able to say some useful things about P. For example, we toss a coin 80 times and observe 43 heads. Is the coin fair? Or, is the probability of heads equal to 0.5? Here, the coin actually has been tossed 80 times.

Chapter 2 gives a formal discussion about probability.

Chapter 3 introduces he use of probability in simpler situations.

Chapter 4 introduces the use of probability in more complicated situations.

Chapter 5 uses computers to generate data based on a probability model.

Chapter 6 talks about the treatment of a sequence of related random observations (called a stochastic process).

Chapter 7 introduces a special process useful in the industry.

Chapter 2. Probability

In the situation where uncertainty arises, we may want to gauge the likelihood of events of interest. In mathematics, such a likelihood is measured by probability. A probability measure is a function, denoted by $P$, that assigns to each event a number between 0 and 1 inclusive depending on how likely the event is to occur.

The Fundamental Principle of Counting

A task can be completed in 2 steps. If there are $m$ ways in step one and $n$ ways in step two, then there are $m\cdot n$ ways of performing the task.

The result can be extended to any task that requires more than 2 steps. This is called the Fundamental Principle of Counting, or called the multiplicative rule of counting.

Example. Arrange 5 people to stand in a line. In how many ways can this be done?

Solution.

The task is to arrange 5 people (called A, B, C, D, and E) in a line. Five steps can be taken.

Step 1: Arrange A. There are 5 ways, since the person can be in any of the 5 positions in the line to be formed.

Step 2: Arrange B. There are 4 ways, since the person can be in any of the remaining 4 positions.

Step 3: Arrange C. There are 3 ways, since the person can be in any of the 3 remaining positions.

Step 4: Arrange D. There are 2 ways, since the person can be in any of the 2 remaining positions.

Step 5: Arrange E. There is only one way, since the person must be in the position that is available for them.

By the Fundamental Principle of Counting, there are $5\cdot 4\cdot 3\cdot 2\cdot 1$ or 24 ways to arrange 5 people in a line.

In general, if there are $n$ distinct objects to be arranged in order, there are $n\cdot (n-1)\cdot\cdots\cdot 3\cdot 2\cdot 1$ ways. This number is denoted $n!$. For example, $3! = 6, 4!=24, 6!=720$, but $0!=1$ (why?).

For your reference, here is a tutorial for creating mathematical notations in R: https://rpruim.github.io/s344s18/from-class/MathinRmd.html

Example. Arrange 5 people to stand in a line. Suppose that Tom and Jerry are two of the 5 people. In how many ways can they stand so that Tom and Jerry are not adjacent?

Solution.

To complete the task, we take 3 steps.

Step 1: Arrange the other three people first. There are 3! or 6 ways.

Step 2: Insert Tom to one of the 4 (why 4?) available places. There are 4 ways.

Step 3: Insert Jerry to one of the remaining (why 3?) available places. There are 3 ways.

By the multiplicative counting principle, there are $6\cdot 4\cdot 3=72$ ways to arrange the 5 people.

Exercise. Dave arrives at an airport which has twelve gates arranged in a straight line with exactly 100 feet between adjacent gates. His departure gate is assigned at random. After waiting at that gate, Dave is told the departure gate has been changed to a different gate, again at random.

How many possibilities are there?
How many possibilities are there such that Dave walks 400 feet or less to the new gate.

The Number of Combinations

Many time, we need to figure out the number of ways of choosing $k$ distinct objects from a set of $n$ distinct ones.

Example. In how many ways can 3 persons be chosen from 10 people?

Solution.

This is still a counting problem. The task is to find the number of ways of choosing 3 from 10 distinct objects. Instead of tackling the problem directly, we consider another task: In how many ways can we arrange 10 people in a line? Although we know the answer is $10!$, we still take 3 steps for this new task.

Step 1: Choose 3 persons. This can be done in $x$ ways. We need to determine $x$.

Step 2: Place the 3 chosen persons at the beginning of the line. This can be done in $3!$ ways.

Step 3: Place the remaining $(10-3)$ persons in the remaining positions. This can be done in $(10-3)!$ ways.

By the Fundamental Principle of Counting, the new task can be performed in $x(3!)(10-3)!$ ways.

However, we must have the equation:

\[x(3!)(10-3)! = 10!\] Solving the equation yields \[x = \frac{10!}{3!\cdot (10-3)!}\] In general, the number of ways of choosing $k$ distinct objects from a set of $n$ distinct ones is given by

\[\frac{n!}{k!\cdot (n-k)!}\] The quantity is often denoted as $\binom{n}{k}$; that is

\[\binom{n}{k}=\frac{n!}{k!\cdot (n-k)!}\] It’s easy to check that $\binom{12}{2}=66$ and $\binom{8}{3}=56$.

Example. A class has 20 students, 12 computer science majors and 8 not.

In how many ways can 5 students be chosen so that exactly 2 are computer science majors?
In how many ways can 5 students be chosen so that at least 2 are computer science majors?

Solution.

We use the FPC. The step 1 is to choose 2 computer science majors. The step 2 is to choose 3 students that are not computer science majors. The first step can be done in $\binom{12}{2}$ ways, and the second step can be done in $\binom{8}{3}$ ways. So, there are $\binom{12}{2}\cdot \binom{8}{3}=66\cdot 56 = 3696$.
This “at least” questions can be handled using the “case work” approach: “At least 2” is the same as 2, 3, 4, or 5 here. So, we need to count the number of ways in each case, and then add them up (called the addition rule of counting).

Case 1: Exactly 2 are computer science majors. We’ve done this and the number of ways is $\binom{12}{2}\cdot \binom{8}{3}$.

Case 2: Exactly 3 are computer science majors. The number of ways is $\binom{12}{3}\cdot \binom{8}{2}$.

Case 3: Exactly 4 are computer science majors. The number of ways is $\binom{12}{4}\cdot \binom{8}{1}$.

Case 4: Exactly 5 are computer science majors. The number of ways is $\binom{12}{5}\cdot \binom{8}{0}$.

Adding up these cases yields

\[\binom{12}{2}\cdot \binom{8}{3}+\binom{12}{3}\cdot \binom{8}{2}+\binom{12}{4}\cdot \binom{8}{1}+\binom{12}{5}\cdot \binom{8}{0}\].

We can also use a complementary method to find the answer:

Find the number of ways of exact one computer science major and the number of ways of no computer science major. Find the total number of ways of choosing any 5 students from a class of 20. Subtract the first two numbers from the third.

The counting methods we introduce will be very useful when we compute probability of an event.

Review of Set Theroy

Before we formally introduce probability, we review the set theory. A set, denoted usually by capital letters such as A, B, and C, is a collection of distinct objects. Examples of sets are A = {1, 2, 3}, B = {H, T}, C = [0, 100], and D = {x|x is a rational number}. A universal set is the set which has elements of all the related sets, without any repetition of elements.

Operations can be defined on sets. Let A and B be sets.

The union of A and B, denoted by $A ~\text{or}~ B$ or $A\cup B$, is the set that contains all elements from A and all elements from B, with common elements included only once.
The intersection of A and B, denoted by $A ~\text{and}~ B$ or $A\cap B$, is the set containing all elements that are both in A and in B.
The difference of A minus B, denoted by $A-B$, is the set containing all elements that are in A but not in B.
The complement of A, denoted by $A^c$ or $A'$, is the set containing all elements that are not in A.

It can be shown that $A-B=A\cap B^c$.

Set operations can be shown using the Venn Diagram:

https://texample.net/tikz/examples/set-operations-illustrated-with-venn-diagrams/

The letter $u$ in each of the above diagram represents the universal set.

Set operations enjoy some nice properties, two of which are:

$(A\cap B)^c=A^c\cup B^c$
$(A\cup B)^c=A^c\cap B^c$

The two properties are called the De Morgan’s Law.

Sample Space, Events, and Probability

Before you toss a coin, you can’t tell which side it will show up. When issuing a credit card to a customer, a company can’t tell whether the customer will default. These are examples of random phenomena where chances drive events to occur and predictions are either impossible (the coin one) or difficult (credit card one). For the second example, we will need to collect other relevant data about the customer in order to help make a prediction (chapter 4 regression matters!).

To gauge chances of random events (tails or default), we use probability. For the coin example, since tails and heads are equally likely and they are the only possible outcomes, the chance (or probability) that the coin lands on tails is said to be 0.5, 1/2, or 50%. The value of the probability must be between 0 and 1, with 0 indicating an impossible event and 1 indicating a sure event. The probability 0.5 here is an example of theoretical probabilities. For the credit card example, the chance (or probability) that a particular customer will default can’t be reasonably defined or determined, so this is an example of personal (or subjective) probabilities.

The sample space ($S$): A sample space is the collection of all possible outcomes when considering a random phenomenon. When tossing a coin, the sample space is the set ${T, H}$, where $T$ represents a tail and $H$ represents a head. When issuing a credit card to a customer, the sample space is the set of ${D, \bar{D}}$, where $D$ represents default and $\bar{D}$ represents no default. When tossing a dice of 6-sides, the sample space is the set of ${1, 2, 3, 4, 5, 6}$.

Events: An event is a subset of the sample space. For the dice example, an event could be the subset ${1, 3, 5}$ meaning that an odd number is obtained, while another event could be ${1, 5}$ meaning that either 1 or 5 is obtained. We use upper-case letters to denote random events, such as $A$, $B$, and $C$. The probability of event $A$ is denoted by $P(A)$.

Probability Rules

There are some rules about probability.

Rule 1: for any event $A$, $P(A)$ must be between 0 and 1; that is $0\le P(A)\le 1$.
Rule 2 (Probability Assignment Rule): $P(S)=1$; that is, the sample space as a special event always occur.
Rule 3 (Complement Rule): For any event $A$, $P(A)+P(A^c)=1$. Here $A^c$ represents the event that $A$ does not occur.
Rule 4 (Multiplication Rule): For any events $A$ and $B$, $P(A~ \text{and}~ B)=P(A)\cdot P(B|A)$. Here $P(B|A)$ is called the conditional probability of event $B$ given that event $A$ has occurred. When $A$ and $B$ are independent (whether one occurs does not affect whether the other occurs), $P(B|A)=P(B)$, and thus, $P(A~ \text{and}~ B)=P(A)\cdot P(B)$. If three events $A$, $B$, and $C$ are independent, then $P(A~ \text{and}~ B~ \text{and}~ C)=P(A)\cdot P(B)\cdot P(C)$.
Rule 5 (Addition Rule): For any events $A$ and $B$, $P(A~ \text{or}~ B)=P(A)+P(B)-P(A~ \text{and}~ B)$. When $A$ and $B$ are disjoint (or mutually exclusive) meaning that the two events can’t occur simultaneously, $P(A~ \text{and}~ B)=0$, and thus, $P(A~ \text{or}~ B)=P(A)+ P(B)$. Note the difference between independence and disjointness.

Example 1.

Assume that Tom, Jerry, and their father Michael default with probability 0.1, 0.2, and 0.3, respectively. Assume independence of default behavior for the three. Find

The probability that both Tom and Jerry default.
The probability that Tom or Jerry default.
The probability that all three default.
The probability that at least one of these three defaults.

Solution.

Let $A$ denote the event that Tom defaults. Let $B$ denote the event that Jerry defaults. Then, $P(A)=0.1$ and $P(B)=0.2$.

By the multiplication rule of probability, $P(A~ \text{and}~ B)=P(A)\cdot P(B)=0.1\cdot 0.2 = 0.02$, or $2\%$.
By the addition rule of probability, $P(A~ \text{or}~ B)=P(A)+P(B)-P(A~ \text{and}~ B)=0.1+0.2-0.02=0.28$.
By the multiplication rule of probability, $P(A~ \text{and}~ B~ \text{and}~ C)=P(A)\cdot P(B)\cdot P(C)=0.1\cdot 0.2\cdot 0.3 = 0.006$, or $0.6\%$.
By the complement rule, \[P(\text{at least one of these three defaults})=1-P(\text{none of these three defaults})\] \[=1-P(A^c~ \text{and}~ B^c~ \text{and}~ C^c)\] \[=1-P(A^c)\cdot P(B^c)\cdot P(C^c)=1-0.9\cdot 0.8\cdot 0.7=0.496\].

Exercise A biased coin lands heads with probability $p$. The coin is tossed $n$ times.

what is the probability that there is at least one head?
what is the probability that there is at most one head?
Given that there was at least one head in the $n$ tosses, what is the probability that there was exactly one head?

Joint Probability and Contingency Tables

A pivot table with rows and columns is called a contingency (or two-way) table. A contingency table must involve two categorical variables. For example, if we cross-classify people according to sex and party of affiliation, then the rows can be Male and Female, and the columns can be Democrat, Republican, and Independence.

Example 1.

There are 10,000 registered voters in a county.

5200 are female.
4600 are democrats, 3900 are republicans, and 1500 are independents.
54% republicans are male, 43% democrats are male, and 47.73% of independents are male.

Construct a contingency table for the data, with rows (Male or Female) and columns (Democrat, Republican, and Independent).
Determine

the percentage of females.
the percentage of males.
the percentage of democrats.
the percentage of republicans.
the percentage of independents.

Randomly choose a registered voter from this county. Determine

the probability that the selected is female.
the probability that the selected is male.
the probability that the selected is a democrat.
the probability that the selected is a republican.
the probability that the selected is an independent.
the probability that the selected is a female republican.

All of these probabilities are examples of the so-called marginal probability, each of which involves only one variable.

Determine

what percent of female voters are democrats.
what percent of democrats are female.

Solution.

The table is

	Democrat	Republican	Independence	Total
Female	2622	1794	784	5200
Male	1978	2106	716	4800
Total	4600	3900	1500	10000

2(a). $5200/10000=0.52=52\%$

2(b). $4800/10000=0.48=48\%$

2(c). $4600/10000=0.46=46\%$

2(d). $3900/10000=0.39=39\%$

2(e). $1500/10000=0.15=15\%$

3(f). $52\%$

3(g). $48\%$

3(h). $46\%$

3(i). $39\%$

3(j). $15\%$

3(k). 1794 registered voters are female republican, and this is 17.94% out of all registered voters, so the probability that a randomly selected registered voter is female republican is 17.94%. This is an example of the so-called joint probability, because it involves both categorical variables.

4(l). $2622/5200=0.5042$ or 50.42%. This is an example of conditional probability, since it is restricted to a subgroup (which determines the denominator).

4(m). $2622/4600=0.57$ or 57%. This is another conditional probability with a different denominator!

Joint probability can be computed using a probability tree.

Example. A bag contains 9 marbles, 4 red and 5 blue. The marbles are identical except color. Randomly select 2 marbles one by one from the bag without replacement. What is the probability that both marbles are red?

Solution.

The probability tree can be constructed as

https://mathsmadeeasy.co.uk/gcse-maths-revision/probability-tree-diagrams/

The tree diagram shows that $P(\text{Red in the first selection})=\frac{4}{9}$ and $P(\text{Red in the second selection|the first selection is red})=\frac{3}{8}$, so the probability that both selection are red is the product of the marginal probability and conditional probability, or $\frac{4}{9}\cdot \frac{3}{8}$, or $\frac{1}{6}$.

A nice video regarding probability tree is: https://www.youtube.com/watch?v=ql2qLe4UYK0. Another one is here: https://www.youtube.com/watch?v=dRdCUUgrwVw

In general, a joint probability equals the product of a marginal probability and a conditional probability.

The Law of Total Probability

Let a sample space $S$ (as a universal set) be decomposed into k disjoint subsets (events) denoted $A_1, A_2, \cdots, A_k$. The super set ${A_1, A_2, \cdots, A_k}$ is called a decomposition of the sample space.

Let $B$ be an event. Then, by a Venn Diagram showing the $B$ set and the $k$ subsets, we have

\[B=(B\cap A_1)\cup (B\cap A_2)\cup \cdots \cup (B\cap A_k)\] Since the subsets in parentheses are all disjoint, we have

\[P(B)=P(B\cap A_1) + P(B\cap A_2) + \cdots + P(B\cap A_k)\] Since each of the joint probabilities $P(B\cap A_i), i = 1, 2, \cdots, k$ can be written using the conditional probability formula, we have

\[P(B)=P(A_1)P(B|A_1) + P(A_2)P(B|A_2) + \cdots + P(A_k)P(B|A_k)\] This is the legendary Law of Total Probability.

Example 1. $20\%$ of a school’s computers are manufactured by company A, $30\%$ by company B, and remaining $50\%$ by company C. Suppose that $2\%$ of A computers are defective, $3\%$ of B computers are defective, and $2.5\%$ of C computers are defective.

A computer is randomly selected from the school, what is the probability that the computer is defective?
What percent of the school’s computers are defective?

Solution.

Each of the school’s computers has a unique ID, so all computers are distinct. Let $S$ be the set of all computers in the school. $S$ is a sample space. Let $A_1$ be the event that a randomly selected computer is from company A. Let $A_2$ be the event that a randomly selected computer is from company B. Let $A_3$ be the event that a randomly selected computer is from company C. We already know that $P(A_1) = 0.20, P(A_2)=0.30, P(A_3)=0.50$, and the super set {A_1, A_2, A_3} is a decomposition of the sample space. Let $B$ be the event that a randomly selected computer is defective. We also know that $P(B|A_1)=0.02, P(B|A_2)=0.03, P(B|A_3)=0.025$. By the Law of Total Probability, we have \[P(B)=P(A_1)P(B|A_1) + P(A_2)P(B|A_2) + P(A_3)P(B|A_3)=(0.20)(0.02)+(0.30)(0.03)+(0.50)(0.025)=0.004+0.009+0.0125=0.0255.\]
25.5% of the school’s computers are defective.

The Bayes’ Rule

Let $B$ be an event. The Law of Total Probability states that

\[P(B)=P(A_1)P(B|A_1) + P(A_2)P(B|A_2) + \cdots + P(A_k)P(B|A_k)\] How can we calculate these conditional probabilities $P(A_1|B), P(A_2|B), \cdots, \text{and}~P(A_k|B)$?

By the definition of conditional probability, we have $P(A_1|B)=\frac{P(A_1\cap B)}{P(B)}$. Then, apply the conditional probability formula again to the numerator and applying the Law of Total Probability to the denominator yield

\[P(A_1|B)=\frac{P(A_1)P(B|A_1)}{P(A_1)P(B|A_1) + P(A_2)P(B|A_2) + \cdots + P(A_k)P(B|A_k)}\] which is the legendary Bayes’ Law. Note that the numerator is one of the terms in the denominator!

Example 2. $20\%$ of a school’s computers are manufactured by company A, $30\%$ by company B, and remaining $50\%$ by company C. Suppose that $2\%$ of A computers are defective, $3\%$ of B computers are defective, and $2.5\%$ of C computers are defective. A computer is randomly selected from the school.

If the chosen computer is defective, what is the probability that the computer is from company A?
If the chosen computer is defective, what is the probability that the computer is from company B?
If the chosen computer is defective, what is the probability that the computer is from company C?

Solution.

Each of the school’s computers has a unique ID, so all computers are distinct. Let $S$ be the set of all computers in the school. $S$ is a sample space. Let $A_1$ be the event that a randomly selected computer is from company A. Let $A_2$ be the event that a randomly selected computer is from company B. Let $A_3$ be the event that a randomly selected computer is from company C. We already know that $P(A_1) = 0.20, P(A_2)=0.30, P(A_3)=0.50$, and the super set {A_1, A_2, A_3} is a decomposition of the sample space. Let $B$ be the event that a randomly selected computer is defective. We also know that $P(B|A_1)=0.02, P(B|A_2)=0.03, P(B|A_3)=0.025$. By the Law of Total Probability, we have \[P(B)=P(A_1)P(B|A_1) + P(A_2)P(B|A_2) + P(A_3)P(B|A_3)=(0.20)(0.02)+(0.30)(0.03)+(0.50)(0.025)=0.004+0.009+0.0125=0.0255.\]

$P(A_1|B)=\frac{0.004}{0.0255}=0.1569$
$P(A_2|B)=\frac{0.009}{0.0255}=0.3529$
$P(A_3|B)=\frac{0.0125}{0.0255}=0.4902$

DIY: Construct a probability tree for the problem.

Chapter 3. Discrete Random Variables and Their Distributions

This chapter deals with random variables and their distributions. A random variable is a variable whose value depends on the outcome of a random event. We use capital letters to denote random variables, such as $X$, $Y$, and $Z$.

Each of the following situations defines a random variable:

Toss a fair dice of 6 sides. Let $X$ denote the point turning up. $X$ is a random variable taking values 1, 2, …, 6, each with probability 1/6.
Let $Y$ denote the number of books a customer purchases in a book store. $Y$ is a random variable whose value can be 0, 1, 2, …. The probability of each value can not be determined in this situation.
The closing price, denoted by $Z$, of a particular stock (such as FB) two trading days later is also a random variable whose value can be anything positive.
Randomly choose a person from a country. Let $H$ denote the height of this person. $H$ is a random variable taking a positive value.

Here, random variables $X$ and $Y$ take values that are discrete. We call such random variables discrete random variables. The random variables $Z$ and $H$ can take any value in an interval, so we call them continuous random variables.

For any random variable $X$, the function $F_X(x)=P(X\le x)$ is called the cumulative distribution function ($CDF$). The function $F$ takes values between 0 and 1, and is non-decreasing.

If any two random variables $X$ and $Y$ are considered, the function $F_{X,Y}(x,y)=P(X\le x, Y\le y)$ is called the joint cumulative distribution function.

It’s easy to see that $F_X(x)=\displaystyle \lim_{y \to \infty} F_{X,Y}(x, y)$ and $F_Y(y)=\displaystyle \lim_{x \to \infty} F_{X,Y} (x, y)$.

Discrete Distributions and Their Characteristics

For each discrete random variable, we can list its possible values and the corresponding probabilities and organize them in a table. Such a table gives the so-called distribution (or probability model) of the random variable. We can also use a function to describe the distribution:

\[p_{X} (x)=P(X=x)\] which is called the probability mass function of the random variable $X$. For all possible values of $x$, the sum of $p_X (x)$ must equal 1.

When two discrete random variables are jointly considered, the joint distribution (or the joint probability mass function) is defined by \[p_{X,Y} (x,y)=P(X=x, Y=y).\] For all pairs of $x$ and $y$, the sum of $p_{X,Y} (x,y)$ must equal 1.

For two discrete random variables, the joint cumulative distribution function is defined by

\[F_{X,Y} (x,y)=P(X\le x, Y\le y).\]

A two-way table can be used to represent the joint distribution.

Example 1. What is the value of $c$, if $p(x) = cx, x = 1, 2, 3, 4$, and 5 is the probability mass function of a random variable?

Solution.

Since $p(1)=c, p(2)=2c, p(3)=3c, p(4)=4c, p(5)=5c$, and the sum of these must equal 1, we have $15c=1$, or $c=\frac{1}{15}$.

Example 2. Does the following table represent a joint distribution?

https://calcworkshop.com/joint-probability-distribution/joint-discrete-random-variables/

Solution.

Yes, since all the 9 decimal numbers in the table are between 0 and 1 and sum to 1.

If the joint distribution of two discrete random variables are given, we can find the distribution of each of the two individual random variables as follows:

$p_X(x) = \sum_y p(x,y)$ and $p_Y(y) = \sum_x p(x,y).$

These two distributions are called the marginal distributions. To get the marginal distribution of $X$, we do

\[p_X(1)=p(1,1)+p(1,2)+p(1,3)=0.32+0.03+0.01=0.36\] \[p_X(2)=p(2,1)+p(2,2)+p(2,3)=0.06+0.24+0.02=0.32\] \[p_X(3)=p(3,1)+p(3,2)+p(3,3)=0.02+0.03+0.27=0.32\] So, the (marginal) distribution of $X$ is

x	1	2	3
$p_X(x)$	0.36	0.32	0.32

Similarly, the (marginal) distribution of $Y$ can be obtained and it is

x	1	2	3
$p_Y(y)$	0.40	0.30	0.30

Example 3. Give the distribution (or probability model) for the random variable in each of the following situations.

Toss a fair dice of 6 sides. Let $X$ denote the point turning up.
Flip a fair coin twice. Let $Y$ denote the number of heads obtained.

Solution.

The distribution or probability model of $X$ is

x	1	2	3	4	5	6
$p_X(x)$	$\frac{1}{6}$	$\frac{1}{6}$	$\frac{1}{6}$	$\frac{1}{6}$	$\frac{1}{6}$	$\frac{1}{6}$

The distribution or probability model of $Y$ is

y	0	1	2
$p_Y(y)$	$\frac{1}{4}$	$\frac{1}{2}$	$\frac{1}{4}$

The reason: There are only 3 possibilities: two heads, two tails, and one head and one tail. X=0 means that both flips result in tails. The probability that each flip is a tail is 1/2. Since flips are independent, getting two tails has probability of (1/2)(1/2) or 1/4. X=2 means that both flips are heads. Since the coin is fair, the probability is also 1/4. Finally, the probability of getting one head and one tail is $1-1/4-1/2=1/4$.

We are often interested in answering the question: what value would we expect a random variable to be? Such a value is called the mean (or expected value) of the random variable.

For a discrete random variable, the expected value is the weighted average of the possible values taken by the random variable with weights being the associated probabilities. If the random variable is $X$, its expected value is denoted by $E(X)$, or $\mu_X$.

If the distribution looks like:

$x$	$x_1$	$x_2$	$x_3$	$\cdots$	$x_n$
$p(x)$	$p_1$	$p_2$	$p_3$	$\cdots$	$p_n$

Then, the mean of this distribution (of the random variable X) is

\[\mu_X=x_1\cdot p_1+x_2\cdot p_2+x_3\cdot p_3+\cdots+x_n\cdot p_n\]

Example 4. Calculate the expected value for each of the following random variable whose distribution (or probability model) is given.

$X$ has the following probability model:

x	1	2	3	4	5	6
$p(x)$	$\frac{1}{6}$	$\frac{1}{6}$	$\frac{1}{6}$	$\frac{1}{6}$	$\frac{1}{6}$	$\frac{1}{6}$

$Y$ has the following probability model:

y	0	1	2
$p(y)$	$\frac{1}{4}$	$\frac{1}{2}$	$\frac{1}{4}$

For the fixed prizes in the PowerBall lottery, the relevant numbers are

event	prize	probability
match 5	$200,000	1 / 3563608.83
match 4 + powerball	$10,000	1 / 584431.85
match 4	$100	1 / 14254.44
match 3 + powerball	$100	1 / 11927.18
match 3	$7	1 / 290.91
match 2 + powerball	$7	1 / 745.45
match 1 + powerball	$4	1 / 126.88
powerball	$3	1 / 68.96

Let $P$ denote the prize.

Solution.

$E(X)=(1)(\frac{1}{6})+(2)(\frac{1}{6})+(3)(\frac{1}{6})+(4)(\frac{1}{6})+(5)(\frac{1}{6})+(6)(\frac{1}{6})=3.5$
$E(Y)=(0)(\frac{1}{4})+(1)(\frac{1}{2})+(2)(\frac{1}{4})=1$
$E(P)=(200000)(1 / 3563608.83)+(10000)(1 / 584431.85) + (100)(1 / 14254.44) + \\ (100)(1 / 11927.18) + (7)(1 / 290.91) + (7)(1 / 745.45) + (4)(1 / 126.88) + (3)(1 / 68.96)\\ =\$0.20$

A the price of a lottery ticket is $1.00, so the profit is $0.20 - 1 = -0.80$. That is , on average, a buyer loses about 80 cents for each ticket. It excludes any jackpot winnings, which vary with the size of the jackpot.

When people say that stocks are riskier than bonds, they mean the price of a stock as a random variable rises and falls frequently. The quantity that measures the variation or spread of a random variable is called its variance, denoted $Var(X)$ or $\sigma^2$. The square root of the variance, denoted $SD(X)$ or $\sigma$, is called the standard deviation. Neither can be negative.

To calculate the variance of a random variable (say $X$), follow the steps:

Calculate the mean (expected value) of the random variable, denoted by $E(X)$ or $\mu_X$.
Subtract the mean from each possible value of the random variable, and then square these differences (or deviations).
Multiply the squared differences by the associated probabilities.
Add the above products.

If the distribution looks like:

$x$	$x_1$	$x_2$	$x_3$	$\cdots$	$x_n$
$p(x)$	$p_1$	$p_2$	$p_3$	$\cdots$	$p_n$

Then, the variance of this distribution (of the random variable X) is

\[Var(X)=\sigma^2=(x_1-\mu)^2\cdot p_1+(x_2-\mu)^2\cdot p_2+(x_3-\mu)^2\cdot p_3+\cdots+(x_n-\mu)^2\cdot p_n\]

Example 5.

Find the mean, variance, and standard deviation of the random variable $X$ whose distribution (or probability model) is

x	2.5	4.0	6.8
Probability	0.2	0.3	0.5

Solution.

The mean $\mu=(2.5)(0.2)+(4.0)(0.3)+(6.8)(0.5)=0.5+1.2+3.4=5.1$.

The variance $\sigma^2=(2.5-5.1)^2\cdot (0.2)+(4.0-5.1)^2\cdot (0.3)+(6.8-5.1)^2\cdot (0.5)=3.16$.

The standard deviation $\sigma=\sqrt{3.16}=1.78$.

Example 6 Randomly choose a number from the set {2, 5, 9, 12, 15, 20, 22, 28, 31}. Denote the chosen number by $X$. Write down the distribution of the random variable X, and find

the mean of X ($\mu$),
the variance of X ($\sigma^2$),
the standard deviation of X ($\sigma$), and
the coefficient of variation ($CV$) of X; that is, the ratio of the standard deviation to the mean.

Note: Each of the above quantity computed is also said to be a quantity of the original set (or population).

Solution.

The distribution of $X$ is discrete, and $X$ takes each of the distinct values listed in the set with the same probability of 1/8.

$E(X) = \mu = (2)\frac{1}{8}+(5)\frac{1}{8}+(9)\frac{1}{8}+(12)\frac{1}{8}+(15)\frac{1}{8}+(20)\frac{1}{8}+(22)\frac{1}{8}+(28)\frac{1}{8}+(31)\frac{1}{8}=16$
$Var(X) = V(X)=\sigma^2 = E(X^2)-[E(X)]^2=(2^2)\frac{1}{8}+(5^2)\frac{1}{8}+(9^2)\frac{1}{8}+(12^2)\frac{1}{8}+(15^2)\frac{1}{8}+(20^2)\frac{1}{8}+(22^2)\frac{1}{8}+(28^2)\frac{1}{8}+(31^2)\frac{1}{8}-16^2=87.9375$
$sd(X)=\sqrt{Var(X)}=\sigma=\sqrt{87.9375}=9.3775$
$VC(X) = \frac{\sigma}{\mu}=0.5861$

Some properties of the expected value and the variance of any random variable:

$E(X+c)=E(X)+c$, where $c$ is a constant.
$E(a\cdot X)=a\cdot E(X)$, where $a$ is a constant.
$E(X+Y)=E(X)+E(Y)$.
$Var(X)=E(X^2)-[E(X)]^2$
$Var(X+c) = Var(X)$, where $c$ is a constant.
$Var(a\cdot X)=a^2\cdot Var(X)$, where $a$ is a constant.
If random variables $X$ and $Y$ are independent, then $Var(X+Y)=Var(X)+Var(Y)$.

If $X$ is a discrete random variable having the distribution $P(X=x_i)=p_i, i = 1, 2, \cdots, n$, then, for any function $h$ of $X$, $h(X)$ is a new random variable, and $E(h(X))=\sum_{i=1}^{n} h(x_i)p_i$.

For two discrete random variables having the joint probability mass function $P(X=x_i, Y=y_j)=p_{ij}, ~i = 1, 2, \cdots, n, ~j = 1, 2, \cdots, m$, then, for any function $h$ of $X$, $E(h(X, Y))=\sum_{i=1}^{n}\sum_{j=1}^{m} h(x_i, y_j)p_{ij}$.

Example 7.

If $E(X)=6$, $E(Y)=10$, $Var(X)=3$ and $Var(Y)=4$, and the two random variables $X$ and $Y$ are independent, find

$E(X-20)$,
$E(4X)$,
$E(3X-2)$,
$E(2X+5Y)$,
$Var(Y-1000)$,
$Var(-2X)$, and
$Var(4X-2Y)$,
$SD(4X-2Y)$,
$E(4X-2Y+3)$,
$Var(4X-2Y+3)$.

Solution.

$E(X-20)=E(X)-20=6-20=-14$,
$E(4X)=4\cdot E(X)=4\cdot 6=24$,
$E(3X-2)=E(3X)-E(2)=3E(X)-2=3(6)-2=18-2=16$,
$E(2X+5Y)=2E(X)+5E(Y)=2(6)+5(10)=12+50=62$,
$Var(Y-1000)=Var(Y)=4$,
$Var(-2X)=(-2)^2\cdot Var(X)=4(3)=12$, and
$Var(4X-2Y)=4^2Var(X)+(-2)^2Var(Y)=16(3)+4(4)=48+16=64$.
$SD(4X-2Y)=\sqrt{Var(4X-2Y)}=\sqrt{64}=8$,
$E(4X-2Y+3)=4E(X)-2E(Y)+3=4(6)-2(10)+3=7$,
$Var(4X-2Y+3)=Var(4X-2Y)=64$, using the result of (g).

Example 8.

We are considering investing $1000 into one or possibly two different investment funds. Historically, each has delivered 5% a year in profit with a standard deviation (or volatility) of 3%. So, a $1000 investment would produce $50 with a standard deviation of $30.

What are the relative advantages and disadvantages of putting $1000 into one, or splitting the $1000 and putting $500 into each?

Solution.

To analyze the different investment options, we need to compare the means and standard deviations of the returns.

Let $X$ be the amount gained by putting $1000 into one. Let $W$ be the amount gained by putting $1000 into each.

Let $W_1$ and $W_2$ be the amounts gained in each fund. Then, $W=W_1+W_2$.

Now,

\[E(X)=1000(0.05)=\$50\] \[SD(X)=(1000)(0.03)=\$30\]

\[E(W_1)=(500)(0.05)=\$25\] \[SD(W_1)=(500)(0.03)=\$15\]

\[E(W_2)=(500)(0.05)=\$25\] \[SD(W_2)=(500)(0.03)=\$15\]

\[E(W)=E(W_1+W_2)=E(W_1)+E(W_2)=25+25=\$50\] \[Var(W)=Var(W_1+W_2)=Var(W_1)+Var(W_2)=15^2+15^2=450\] \[SD(W)=\sqrt{450}= \$ 21.21\] In summary, both investments have a return of $50, but diversifying results in smaller standard deviation, or lower risk.

Exercise 1.

For a discrete random variable $X$ with probability mass function $p(x)$, $-log_{2} \{p(X)\}$ is a new random variable. The mean $E[-log_{2} \{p(X)\}]$ is called the entropy of (the distribution of) the random variable $X$.

Calculate the entropy of each of the following distributions:

$p(1)=p(0)=0.5$
$p(1)=1$
$p(1)=0.8, p(0)=0.2$
$p(1)=0.1, p(2) = 0.3, p(3)=0.6$

Exercise 2. Randomly select 5 numbers with replacement from the set {1, 2, 5}. Let $X$ denote the maximum.

What is the cumulative distribution function of $X$?
What is its probability mass function?

If the set is unknown to us and it is of interest to estimate the maximum of the set, then the maximum of the 5 selected numbers is said to be a point estimate of the unknown maximum of the set. Assuming the set is as given above, find

$E(X)-5$, which is called the bias of $X$, and
$[E(X-5)^2]$, which is called the mean squared error of $X$.

Exercise 3.

Suppose $X$ is a discrete random variable taking, with equal probability, each of the first $N$ natural numbers as a value. Find $E(X)$ and $Var(X)$.

The Covariance and Correlation between Two Random Variables

The covariance and correlation are commonly used to characterize the degree that two random variables are linearly related. They are defined here.

The covariance between two random variables are defined by \[Cov(X, Y)=E[(X-\mu_X)\cdot (Y-\mu_Y)].\] Which is equivalent to \[Cov(X, Y)=E[XY]-\mu_X \cdot \mu_Y.\]

Proof:

By definition,

\[Cov(X, Y)=E[(X-\mu_X)\cdot (Y-\mu_Y)].\] Breaking the parentheses yields

\[Cov(X, Y)=E[XY-X\cdot \mu_Y-\mu_X\cdot Y+ \mu_Y\cdot \mu_Y],\] which can be written as

\[Cov(X, Y)=E[XY]-E[X\cdot \mu_Y]-E[\mu_X\cdot Y]+ E[\mu_Y\cdot \mu_Y].\] Further simplification gives

\[Cov(X, Y)=E[XY]-E[X]\cdot \mu_Y-\mu_X\cdot E[Y]+ \mu_Y\cdot \mu_Y,\] which is just

\[Cov(X, Y)=E[XY]-\mu_X \cdot \mu_Y.\]

The correlation between two random variables are defined by \[\rho_{X,Y}\equiv Corr(X, Y)=\frac{Cov(X, Y)}{\sigma_X\cdot \sigma_Y}.\] It can be shown that the value of $\rho_{X,Y}$ is always between $-1$ and 1, inclusive, but the value of $Cov(X, Y)$ can be any number from $-\infty$ to $\infty$.

When calculating the covariance or correlation between two random variables, the joint distribution of the two random variables are usually given or otherwise obtained.

Example 1. Given the joint distribution:

https://www.statisticshowto.com/joint-probability-distribution/

Find

$E(X)$ and $E(Y)$,
$Var(X)$ and $Var(Y)$,
$E(XY)$,
$Cov(X, Y)$, and
$\rho_{X,Y}$.

Solution.

We need to find the marginal distributions first. To find the marginal distribution of $X$. Note that

$p(1) = P(X=1)=P(X=1, Y=1)+P(X=1, Y=2)+P(X=1, Y=3)=0+\frac{1}{6}+\frac{1}{6}=\frac{1}{3}$, and similarly, $p(2)=P(X=2)=\frac{1}{3}$, and $p(3)=P(X=3)=\frac{1}{3}$, so the marginal distribution of $X$ is $p(x)=\frac{1}{3}$, for $x=1,2,$ and 3.

By the same token, the marginal distribution of $Y$ is $p(y)=\frac{1}{3}$, for $y=1,2,$ and 3.

The mean of $X$ is $E(X)=(1)\frac{1}{3}+(2)\frac{1}{3}+(3)\frac{1}{3}=2$. The mean of $Y$ is $E(Y)=2$.
$Var(X)=E(X^2)-[E(X)]^2=(1^2)\frac{1}{3}+(2^2)\frac{1}{3}+(3^2)\frac{1}{3}-2^2=\frac{2}{3}$. $Var(Y)=\frac{2}{3}$.
\[\begin{align*} E(XY) & = (1)(1)(0)+(1)(2)(1/6)+(1)(3)(1/6) +\\ &~~~~~(2)(1)(1/6)+(2)(2)(0)+(2)(3)(1/6) + \\ &~~~~~(3)(1)(1/6)+(3)(2)(1/6)+(3)(3)(0)\\ &=11/3 \end{align*}\]
$Cov(X,Y)=E(XY)-E(X)\cdot E(Y)=11/3-(2)(2)=-1/3$
$\rho_{X,Y}=\frac{Cov(X,Y)}{\sigma_X \cdot \sigma_Y}=\frac{-1/3}{\sqrt{2/3}\cdot \sqrt{2/3}}=-0.5$

Special Discrete Distributions

Consider a discrete random variable $X$, which denotes the number of heads obtained when flipping a fair coin $n$ times. Obviously, the possible values of $X$ are 0, 1, 2, 3, …, and $n$. The probability that $X=x$ can be found by using the following formula:

\[p(x)=P(X=x) = {{n}\choose{x}} \cdot p^x \cdot (1-p)^{n-x}\] where $p$ is the probability of flipping a head, ${{n}\choose{x}}=\frac{n!}{x!\cdot (n-x)!}$, and $n!$ is the product of the first $n$ natural numbers (For example, $3!=3\cdot 2\cdot 1=6$ and $5!=5\cdot 4\cdot 3\cdot 2\cdot 1=120$). For a balanced coin, $p=0.5$.

The above probability formula is called the Binomial probability model, and $X$ is said to have a Binomial distribution. When $n=1$, the Binomial distribution is called the Bernoulli distribution.

It can be shown that the mean and the variance of the random variable $X$ are $n\cdot p$ and $n\cdot p\cdot (1-p)$ respectively.

Watch a video for calculating the Binomial probability: https://www.youtube.com/watch?v=uqDSRAPXXjI

Example 1.

Flip a fair coin 10 times. Let $X$ denote the number of heads. Find

The probability that $X=4$, that is $p(4)$,
The probability that $X<2$,
The mean of $X$,
The variance of $X$, and
the standard deviation of $X$.

*Solution.$

$P(X=4) = {{10}\choose{4}} \cdot 0.5^4 \cdot (1-0.5)^{10-4}=\frac{10!}{4!\cdot 6!}\cdot 0.0625\cdot 0.015625=0.2051$.
$P(X<2) = P(X = 0 ~\text{or}~ 1) =P(X=0)+P(X=1)=p(0)+p(1)={{10}\choose{0}} \cdot 0.5^0 \cdot (1-0.5)^{10-0} + {{10}\choose{1}} \cdot 0.5^1 \cdot (1-0.5)^{10-1}=\frac{10!}{0!\cdot 10!}\cdot 1\cdot 0.0009765625 + \frac{10!}{1!\cdot 9!}\cdot 0.5\cdot 0.001953125=0.0009765625+0.009765625=0.0101$.
The mean $E(X)=n\cdot p=(10)(0.5)=5$.
The variance $Var(X)=n\cdot p \cdot (1-p)=(10)(0.5)(1-0.5)=2.5$.
The standard deviation $SD(X)=\sqrt{2.5}=1.58$.

Let $X$ denote the number of flips of a coin until a head is shown up. What are the possible values of $X$? What is the distribution of $X$?

It is easy to see that the probability mass function (pmf) of $X$ is

\[p(x)=(1-p)^{x-1}p, x = 1, 2, 3, \cdots\] This distribution is called the geometric distribution, and is denoted by $X\sim Geometric(p)$. The mean of the distribution is $\frac{1}{p}$ and the variance is $\frac{1-p}{p^2}$.

Example 2. If $X\sim Geometric(p=0.4)$, find $P(X>3)$, $P(X>4)$, and $P(X>4|X>1)$.

Solution.

$p(1)=p=0.4$ $p(2)=(1-p)p=(1-0.4)(0.4)=0.24$ $p(3)=(1-p)^2 p=(1-0.4)^2(0.4)=0.144$ $p(4)=(1-p)^3 p=(1-0.4)^3(0.4)=0.0864$ \[\begin{align*} P(X>3) & =1-P(X\le 3)\\ & = 1-P(X=1)-P(X=2)-P(X=3)\\ & = 1-p(1)-p(2)-p(3)\\ & = 1-0.4-0.24-0.144\\ & = 0.216 \end{align*}\] \[\begin{align*} P(X>4) & =1-P(X\le 4)\\ & = 1-P(X=1)-P(X=2)-P(X=3)-P(X=4)\\ & = 1-p(1)-p(2)-p(3)-p(4)\\ & = 1-0.4-0.24-0.144-0.0864\\ & = 0.1296 \end{align*}\] $P(X>4|X>1)=\frac{P((X>4)\cap (X>1))}{P(X>1)}=\frac{P(X>4)}{P(X>1)}$. We know that $P(X>4)=0.1296$, so we need to find $P(X>1)$.

\[\begin{align*} P(X>1) & =1-P(X\le 1)\\ & = 1-P(X=1)\\ & = 1-p(1)\\ & = 1-0.4\\ & = 0.6 \end{align*}\]

Now, $P(X>4|X>1)=\frac{0.1296}{0.6}=0.216$, which equals $P(X>3)$. In general, if a random variable $X$ has a Geometric distribution, then $P(X>m|X>n)=P(X>m-n)$, where $m>n$. This is the so-called memoryless property of the Geometric distribution.

When the probability ($p$) involved in the binomial distribution is small and the number of trials ($n$) is large, the probability mass function can be approximated by a new distribution called the Poisson distribution. The probability mass function of this distribution has the form:

\[p(x)=\frac{\lambda^x}{x!}\cdot e^{-\lambda}\] where the parameter $\lambda>0$ and $x=0, 1, 2, \cdots$. The Poisson distribution is denoted by $X~\sim Poisson(\lambda)$.

It can be shown that the mean of the Poisson distribution equals $\lambda$ and the variance is $\lambda$ as well.

In practice, the Poisson distribution is often used to model the number of occurrences of a certain rare event, such as earthquake.

Example 3. Births in a hospital occur randomly at an average rate of 5.4 births per month.

What is the probability of observing 6 births in a given month at the hospital?
What is the probability of observing 60 births in a year at the hospital?

Solution.

Let $X$ denote the number of births in a month. Then, $X$ has a $Poisson(\lambda=5.4)$ distribution with $\lambda = 5.4$ per month. So, \[p(6)=P(X=6)=\frac{5.4^6}{6!}\cdot e^{-5.4}=0.1555\] The R code: dpois(6, 5.4)
Let $X$ denote the number of births in a year. Then, $X$ has a $Poisson(\lambda)$ distribution with $\lambda = (5.4)\cdot(12)=64.8$ per year. So, \[p(60)=P(X=60)=\frac{64.8^60}{60!}\cdot e^{-64.8}=0.0429\] The R code: dpois(60, 64.8)

The Geometric distribution can be extended to a more general distribution called the Negative binomial distribution. Flip a coin until k heads are seen. Let $X$ denote the total number of trials. The probability mass function of $X$ is

\[p(x)=P(X=x)=\binom{x-1}{k-1}p^{k-1}(1-p)^{x-k}p\] where $x=k, k+1, k+2, \cdots$. The mean is $\frac{k}{p}$ and the variance is $k\frac{1-p}{p^2}$. When $k=1$, the distribution becomes the Geometric distribution.

Example 4. An oil company conducts a geological study that indicates that an exploratory oil well should have a 20% chance of striking oil. What is the probability that the second strike comes on the fifth well drilled?

Solution.

Here $p=0.2, k = 2, x = 5$. Plugging these numbers in the probability mass function of the Negative binomial distribution yields

$p(5)=\binom{5-1}{2-1}0.2^{2-1}(1-0.2)^{5-2}(0.2)=0.08192$

To get this probability, the R code is: dnbinom(x-k, k, p)

Chapter 4. Continuous Distributions

Continuous random variables take values that can be represented by an interval or the union of a few intervals.

The Probability Density Function

The cumulative distribution function $F(x) = P(X\le x)$ of a continuous random variable $X$ is continuous. If $F$ is differentiable, the derivative is called the probability density function, denoted by $f(x)$; that is, $f(x)=F'(x)$.

The properties of $F(x)$ and $f(x)$ of a continuous random variable are listed below:

For any $x$, $0\le F(x)\le 1$,
$F(x)$ is continuous and non-decreasing,
$f(x)\ge 0$ for any $x$,
$\int_{-\infty}^{\infty}f(x)dx=1$,
$P(X=c)=0$ for any constant $c$, and
$P(a<X<b)=\int_{a}^{b}f(x)dx$.

Example 1. Given the probability density function of a random variable $X$,

\[\begin{equation} f(x) = \begin{cases} cx & \text{if } 1 < x < 2 \\ 0 & \text{elsewhere. } \end{cases} \end{equation}\]

determine

the constant $c$, and
the probability that $P(1.5< X\le 1.8)$.

Solution.

Since $\int_{-\infty}^{\infty}f(x)dx=1$, we have \[\int_{1}^{2}cxdx=1\] or,

\[c\cdot \frac{x^2}{2}|_1^2=1\] or,

\[c\cdot \frac{3}{2}=1\] or, \[c=\frac{2}{3}\] (b) $P(1.5<X\le 1.8)= \int_{1.5}^{1.8}\frac{2}{3}xdx= \frac{x^2}{3}|_{1.5}^{1.8}=0.33$

The mean of a continuous random variable $X$ is defined by \[\mu=E(X)=\int_{-\infty}^{\infty}xf(x)dx,\] and the variance is defined by \[Var(X)=\int_{-\infty}^{\infty}(x-\mu)^2f(x)dx.\] It can be shown that \[Var(X)=\int_{-\infty}^{\infty}x^2f(x)dx-\mu^2.\]

If $h(x)$ is a continuous function and $X$ is a continuous random variable with probability density function $f(x)$, then $h(X)$ is also a continuous random variable, with mean that can be calculated by \[E(h(X))=\int_{-\infty}^{\infty}h(x)f(x)dx.\] Especially, $E(X^2)=\int_{-\infty}^{\infty}x^2f(x)dx$, and thus we can write \[Var(X)=E(X^2)-E^2(X)\] $E(X^2)$ is called the second moment of random variable $X$, while the mean $E(X)$ is called the first moment.

Example 2. Let $X$ be a random variable having the following probability density function:

\[\begin{equation} f(x) = \begin{cases} \frac{x}{2} & \text{if } 0 < x < 2 \\ 0 & \text{elsewhere. } \end{cases} \end{equation}\]

Find the mean, second moment, and the variance.

Solution.

The mean $E(X)=\int_{-\infty}^{\infty}xf(x)dx=\int_{0}^{2}x\frac{x}{2}dx=\frac{x^3}{6}|_0^2=\frac{4}{3}.$

The second moment $E(X^2)=\int_{-\infty}^{\infty}x^2f(x)dx=\int_{0}^{2}x^2\frac{x}{2}dx=\frac{x^4}{8}|_0^2=2.$

The variance $Var(X)=E(X^2)-E^2(X)=2-(\frac{4}{3})^2=\frac{2}{9}.$

Exercise 1. Let $X$ be a random variable having the following probability density function:

\[f(x)=\frac{1}{2b}e^{-\frac{|x-\mu|}{b}}, ~~ -\infty<x<\infty\] where $b>0$. Determine

The mean, and
The variance.

Exercise 2. The following functions are called kernel functions. Verify that each of them is symmetric and is a probability density function. Find the mean, second moment, and the variance of each distribution.

Rectangular kernel:

\[\begin{equation} f(x) = \begin{cases} \frac{1}{2} & \text{if } |x|<1 \\ 0 & \text{elsewhere. } \end{cases} \end{equation}\]

Triangular kernel:

\[\begin{equation} f(x) = \begin{cases} 1-|x| & \text{if } |x|<1 \\ 0 & \text{elsewhere. } \end{cases} \end{equation}\]

Bi-weight kernel:

\[\begin{equation} f(x) = \begin{cases} \frac{15}{16}(1-x^2)^2 & \text{if } |x|<1 \\ 0 & \text{elsewhere. } \end{cases} \end{equation}\]

Gaussian kernel:

\[\begin{equation} f(x) = \frac{1}{\sqrt{2\pi}}e^{-x^2/2}, -\infty<x<\infty \end{equation}\]

Bartlett-Epanechnikov kernel:

\[\begin{equation} f(x) = \begin{cases} \frac{3}{4\sqrt{5}}(1-\frac{x^2}{5}) & \text{if } |x|<\sqrt{5} \\ 0 & \text{elsewhere. } \end{cases} \end{equation}\]

Laplace kernel:

\[f(x)=\frac{1}{2}e^{-|x|}, ~~ -\infty<x<\infty\]

The Uniform Distribution

The uniform distribution is used to describe the random variable, denoted by $U$, that takes values in an interval $(a, b)$ uniformly. The probability density function of $U$ is given as follows:

\[\begin{equation} f(x) = \begin{cases} \frac{1}{b-a} & \text{if } a < x < b \\ 0 & \text{elsewhere } \end{cases} \end{equation}\]

It can be shown that the mean of the uniform distribution is $\frac{a+b}{2}$ and variance $\frac{(b-a)^2}{12}$.

If $U$ has a uniform distribution on the interval $(a, b)$, then it is denoted by $U\sim Uniform(a,b)$.

Example 1. If $U\sim Uniform(1,5)$, find

The mean of $U$,
The standard deviation of $U$, and
The probability that $U$ falls in the interval $(1,2)$ or $(4,4.5)$.

Solution.

The mean of $U$: $E(U)=\frac{1+5}{2}=3$.
The standard deviation of $U$: $\frac{(5-1)^2}{12}=\frac{4}{3}$.
The probability that $U$ falls in the interval $(1,2)$ or $(4,4.5)$: $\int_{1}^{2}\frac{1}{4}dx+\int_{4}^{4.5}\frac{1}{4}dx=\frac{1}{4}+\frac{0.5}{4}=\frac{3}{8}$

The exponential Distribution

If $X$ has the probability density function of

\[\begin{equation} f(x) = \begin{cases} \lambda e^{-\lambda x} & \text{if } x>0 \\ 0 & \text{O.W. } \end{cases} \end{equation}\]

then, $X$ is said to have an exponential distribution with parameter $\lambda$, denoted by $X\sim Exponential(\lambda)$.

It turns out that the mean of the exponential distribution is $\frac{1}{\lambda}$ and the standard deviation is also $\frac{1}{\lambda}$.

Example 1. If $X\sim Exponential(\lambda = 3)$, find

The mean of $X$,
The standard deviation of $X$, and
The probability that $X$ is between 2 and 4.

Solution.

The mean equals $\frac{1}{3}$.
The standard deviation also equals $\frac{1}{3}$.
The desired probability equals $\int_{2}^{4}3e^{-3x}dx=-e^{-3x}|_{2}^{4}=0.00247$.

There is a connection between the exponential distribution and the Poisson distribution. Specifically, if the number of arrivals of customers at a store (or phone calls at an exchange center, occurrence of earthquakes) in a unit time interval is independent of the number of arrivals in any other non-overlapping unit interval, and each of these numbers follows a Poisson distribution with the same parameter $\lambda$, the time between any two consecutive arrivals (called the inter-arrival time) is a random variable having an Exponential distribution with mean $\frac{1}{\lambda}$.

The Gamma Distribution

We introduce the gamma function first. For each value of $\alpha$ that is not a negative integer or 0, the following integral

\[\int_0^\infty x^{\alpha-1}e^{-x}dx\]

is well defined, and thus is a function of $\alpha$. The function is denoted by $\Gamma(\alpha)$ and is called the Gamma function.

When $\alpha = n$ is a positive integer, it can be shown that $\Gamma(n+1)=n!$. Thus, the gamma function is an extension of the factorial to positive real numbers. It can also be shown that $\Gamma(\alpha+1)=\alpha\cdot \Gamma(\alpha)$ and $\Gamma(\frac{1}{2})=\sqrt{\pi}$.

A random variable $X$ has a Gamma distribution if its probability density function has the following form:

\[\begin{equation} f(x) = \begin{cases} \frac{\lambda^\alpha}{\gamma (\alpha)} x^{\alpha -1}e^{-\lambda x} & \text{if } x>0 \\ 0 & \text{O.W. } \end{cases} \end{equation}\]

This is denoted by $X\sim Gamma(\alpha, \lambda)$. The mean is $\frac{\alpha}{\lambda}$ and the variance is $\frac{\alpha}{\lambda^2}$.

There is a connection between the $Gamma$ distribution and the $Exponential$ distribution. Specifically, if $X_1$ and $X_2$ are independent having an Exponential distribution with mean $\frac{1}{\lambda}$, then, the sum $X_1 + X_2$ has a $Gamma(\alpha = 2, \lambda)$.

The Normal Distributions

We learned that, for a random variable that takes isolated (discrete) values, we can use a table to list those values along with their probabilities. Many times, however, we will need to work with random variables that take values on an interval. For such continuous random variables, we can not list values one by one. We will have to seek ways to handle this difficulty. Fortunately, we can use a curve to represent the distribution (or model) of a continuous random variable.

The curve should satisfy the following conditions:

It must not be below the x-axis, and
the total area under the curve (AUC) must equal 1.0 or 100%.

The function representing such a curve is called a probability density function, or simply a density function. Usually, we use $f(x)$ to denote a density function.

The Normal distribution: If the function takes the form

\[f(x)=\frac{1}{\sigma\cdot \sqrt{2 \pi }}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\] where $\mu$ and $\sigma$ are constants, we say that the underlying distribution is normal.

The constants $\mu$ and $\sigma$ are called parameters, they happen to represent the mean and standard deviation of the distribution.

When $\mu=0$ and $\sigma=1$, the distribution is said to be a standard normal distribution.

Watch a video for normal distributions: https://www.youtube.com/watch?v=iYiOVISWXS4

Properties of the normal distribution with mean $\mu$ and standard deviation $\sigma$:

The curve is symmetric and bell shaped.
The area under the whole curve equals 1.0 or 100%.
The area, between $\mu-\sigma$ and $\mu+\sigma$, below the normal density curve, and above the x-axis, equals 68% approximately.
The area, between $\mu-2\sigma$ and $\mu+2\sigma$, below the normal density curve, and above the x-axis, equals 95% approximately.
The area, between $\mu-3\sigma$ and $\mu+3\sigma$, below the normal density curve, and above the x-axis, equals 99.7% approximately.

The last 3 properties correspond to the so-called 68-95-99.7 Rule. It means that, when a random value is taken from a normal distribution, with chance about 68%, the value falls within one standard deviation of the mean, with chance about 95%, the value falls within two standard deviations of the mean, with chance about 99.7%, the value falls within three standard deviation of the mean.

How can we calculate the probability that a normal random variable falls within an arbitrary interval? For example, how can we calculate the following:

$P(X<a)$,
$P(X>b)$, and
$P(a<X<b)$

A traditionally very useful result is the following:

If random variable X has a normal distribution with mean $\mu$ and standard deviation $\sigma$, then, the new random variable Z, defined as $Z=\frac{X-\mu}{\sigma}$ and is called the $z-$score, has a standard normal distribution.

The usefulness of this result is that any probability calculation regarding the original random variable X can be done through a calculation on $Z$. So, a probability table built on $Z$ can be useful. Such a table is called a standard normal table. Google it! But, we will not emphasize the use of this kind of table. Instead, we use software!

Example. Let $X$ have a normal distribution with mean 10 and standard deviation 3. Find

$P(X<5)$,
$P(X<12)$,
$P(X>7)$,
$P(X>13)$,
$P(4<X<8)$, and
$P(5<X<14)$.

Solution.

Sketch a normal curve and mark the 68-95-99.7 cutoffs. Now, mark the cutoff “5”. The desired probability equals the area under the curve, above the x-axis, and to the left of 5. To calculate this area In Excel, click a cell and type “=NORM.DIST(5,10,3,1)” and then hit “Enter”, where “5” is the cutoff, “10” is the mean, “3” is the standard deviation, and “1” indicates cumulative probability (from negative infinity to the cutoff, or the left tail area under the curve). The answer is 0.0478.

For R fans, in the console, type “pnorm(5, 10, 3)” and hit “enter”. You get the same answer! Here “p” indicates "cumulative probability.

In the same way, the answer is 0.7475.
Draw a picture again. This time, we find the upper tail area beyond the cutoff 7. We need find left tail area first, then subtract it from 1.0, since the total area under the curve is always 1.0. Your answer should be $1 - 0.1587$, or 0.8413.
The answer should be 0.1587.
Draw a graph again. This time, we need to find an area between two cutoffs. To do that, we first find the left area beyond 4 and then find the left area beyond 8. Subtracting the two areas will give you the answer. The answer should be $0.25249-0.02275$, or 0.2297. I used 5 decimal places in order to get a more accurate result.
You should get 0.8610.

How can we go backwards? That is, given that $X$ has a normal distribution with mean $\mu$ and standard deviation $\sigma$, how can we find a value $x$, such that $P(X<x)$ equals a given number, say $p$. The value $x$ here is called the $100p$-th percentile of a distribution.

Example. Given that the random variable $X$ has a normal distribution with mean 20 and standard deviation 6. Find the

25th percentile,
50th percentile,
75th percentile,
90th percentile, and
95th percentile.

Solution.

In Excel, type “=NORM.INV(0.25, 20, 6)” in a cell and hit “enter”. Your answer should be 15.95 in two deciaml places.

For R fans, in the console, type “qnorm(5, 20, 6)” and hit “enter”. You get the same asnswer! Here “q” indicates “quantile” which is more general than “percentile”.

The answer should be 20.
The answer should be 24.05.
The answer should be 27.69.
The answer should be 29.87.

Example. For those who would like to pursue a graduate degree in business, they would need to take the Graduate Management Admission Test (GMAT). The score range for GMAT is 200 to 800. Based on some data, the distribution of GMAT scores is approximately normally distributed with mean 500 and standard deviation 100.

What percent of scores are

below 300?
above 600?

What should you score in order to place in the

top $10\%$?
top $5\%$?

Solution.

Let $X$ denote the GMAT score.

This is to find the cumulative probability $P(X<300)$. In Excel, type “=NORM.DIST(300, 500, 100, 1)” in a cell and hit “enter”. Your answer should be 0.0228. That is, about 2.28% of GMAT takers have scores below 300.

R fans: In the console, type “pnorm(300, 500, 100)” and hit“enter”.

This is to find the cumulative probability $P(X<600)$, and then subtract from 1.0. Your answer should be $1-0.8413$, or 0.1587.
This is to find $x$ such that $P(X<x)=0.90$ (why 0.90 and not 0.1?). That is, you need to find the 90th percentile. In Excel, type “=NORM.inv(0.90, 500, 100)” and hit enter. Your answer should be 628.16.

R fans: in the console, type “qnorm(0.90, 500, 100)” and hit “enter”.

Your answer should be 664.49.

A final property regarding the normal distribution:

If $X$ has a normal distribution with mean $\mu_1$ and variance $\sigma_1^2$, $Y$ has a normal distribution with mean $\mu_2$ and variance $\sigma_2^2$, and the two random variables are independent, then the distribution of the random variable $X+Y$ has a normal distribution as well, with mean $\mu_1+\mu_2$ and variance $\sigma_1^2+ \sigma_2^2$. This is the famous additivity property of the normal distribution.

Example. If $X$ is normal with mean 10 and variance 4, and independently of $X$, $Y$ has a normal distribution with mean 5 and variance 2. What is the distribution of the sum $X+Y$?

Solution.

The sum $X+Y$ has a normal distribution with mean 15 and variance 6. The standard deviation is $\sqrt{6}$.

The end of the chapter!

Chapter 5. Computer Simulations and Monte Carlo Methods

Computer simulation is the regeneration of a process by writing and running computer code. It is widely used to study the dynamic behavior of an object or a system in response to changes in various conditions.

Monte Carlo methods or Monte Carlo simulations are techniques used to estimate quantities that can be otherwise difficult to calculate. Examples of such quantities include

the integration of a function over an interval,
the probability that no two have the same birthday among a group people, and
the distribution of a system when in equilibrium.

We will use the R programming language in this chapter.

What is R?

R is a programming language, free and open source. It is widely used and is among the top 5 choices for data analysis. You can download R and RStudio on your own computer: https://www.youtube.com/watch?v=_2sewGCA0y4. You can also use them in the cloud: https://www.youtube.com/watch?v=uK1Va_UWQFc&t=244s.

Here is an example of using R for visualizing data: https://rpubs.com/nicolemjacobo/Project4.

If you only want to practice the basics of R, just use the current web page!

Coding in R

Coding in R is fairly easy. You can use R as a calculator. Your code must be in a code chunk, a pair of “```”. A code chunk looks something like this:

```{r demo}
x = 3
y = 7
z = x + y

```

Here, within the pair of braces, the letter “r” indicates that the coding language is r. Following the r letter, there might be a space followed by a string that names the code chunk. The name should not contain any white space. If a code chunk has options, such as “exercise=TRUE” and “echo = TRUE”, they will modify the functions of the chunk.

For more details, refer to https://rmarkdown.rstudio.com/lesson-3.html.

I have used a package “learn” that can create a code chunk so that you can write simple code in it. Press “Run Code” to run/execute your code.

3+2

## [1] 5

3-2

## [1] 1

3*2

## [1] 6

3/2

## [1] 1.5

Your code can go with comments, indicated by a “#” sign, to explain what the code is doing. A comment does not change the execution of your code.

3^2 # Find the value of 3 to the power of 2.

## [1] 9

9^0.5 # Find the value of 9 to the power of 0.5 (or square root of 9)

## [1] 3

(20-16)/2*5+3 # Use parentheses when order is important

## [1] 13

You can store values or other things using variables which are called objects. The following code creates 4 objects.

x = 3
y = 5

z = x+y # Store the value of the sum of x and y into the z object.

w = x*y

x   # Print the x

## [1] 3

## [1] 5

## [1] 8

## [1] 15

An Exercise

Here’s a simple exercise with an empty code chunk provided for entering the answer.

Write the R code required to evaluate the square root of the value of $(5+14)^2-5(21-15)$.

Basics of R Programming

We will introduce logical operations, conditional statements, loops, and functions.

Logical Operators, Logical Expressions, and Logical Values

In some situations, we may need to check if a statement is true. The following are examples of statements in R:

x == y (x and y are equal)
x < y (x is smaller than y)
x > y (x is larger than y)
x <= y (x is no greater than y)
x >= y ( x is no smaller than y)
x != y ( x and y are not equal)

More complicated statements can be formed:

(x == y) & (z > 3) (x are equal to y AND z is larger than 3)
(x == y) | (z > 3) (x are equal to y OR z is larger than 3)

R assigns the value “TRUE” to a statement that is true, and the value “FALSE” to a statement that is false. The values “TRUE” and “FALSE” are constants in R. Numerically, “TRUE” is the same as 1 and “FALSE” is the same as 0, so the code “TRUE + 100” would gives 101.

x= 3
y = 3.0

z = (x == y)

cat("The value of z is ", z, "\n")

## The value of z is  TRUE

A = "xyz"
B = "XYZ"
C = (A == B)

cat("The value of C is ", C, "\n")

## The value of C is  FALSE

The if-else Conditional Statement

When 2 different operations are done under 2 different conditions, we can use the if-else statement. Here is an example.

score = 80
if (score < 60){
  cat("Study hard!\n")  # The symbol "\n" means a new line
} else {
  cat("You passed the test.")
}

## You passed the test.

If there are more than 2 conditions, we can use one or more “else if” in between “if” and “else”. Here is an example.

score = 80
if (score < 60){
  cat("Study hard!\n")
} else if (score < 90){
  cat("You did well for the test.")
} else {
  cat("You did excellently for the test.")
}

## You did well for the test.

The for and while Loops

When we know exactly how many times a repeated action will be taken, we use the for loop. The following example will print each element of a vector.

x = c(23, 45, 56, 87, 103)
for (i in x){
  print(i)
}

## [1] 23
## [1] 45
## [1] 56
## [1] 87
## [1] 103

When we do not know how many times a loop will run, we use the while loop. Here is an example:

What is the smallest integer $n$ such that $1 + 2 + 3 + \cdots + n > 100$? The following code will find such $n$.

n = 1
S = 0

while (S <= 100) {
  
  S = S + n

  cat("n = ", n, "S = ", S, "\n")
  
  n = n + 1
}

## n =  1 S =  1 
## n =  2 S =  3 
## n =  3 S =  6 
## n =  4 S =  10 
## n =  5 S =  15 
## n =  6 S =  21 
## n =  7 S =  28 
## n =  8 S =  36 
## n =  9 S =  45 
## n =  10 S =  55 
## n =  11 S =  66 
## n =  12 S =  78 
## n =  13 S =  91 
## n =  14 S =  105

The smallest integer $n$ is 14.

An exercise:

Write code to find all integer solutions to the equation

\[2^x (4-x) = 2x+4\]

R Functions

In mathematics, an equation such as $y = 2x + 1$ defines a function, where $x$ is called the independent variable (or input) and $y$ is called the dependent variable (or output). A function can be named. Let’s name the previous function $f$. The function can be written as $f(x) = 2x + 1$. A function can be treated as a machine. The previous function means: when feeding the input $x$ to the machine $f$, the output is $2x + 1$.

Microsoft Excel has many functions that can be called to do work for us. A function is a self-contained module of code that accomplishes a specific task. For example, the function “sqrt()” allows us to find the square root value of a number. Within the pair of parentheses is a list of parameters or inputs.

Repeatedly doing the same kind of work is boring to human beings but not to machines. R has many built-in functions ready for use. If data are fed to a function, the function will process them and return a result. Once a function is written and stored, it can be used over and over again.

Examples:

sqrt(9) # Call the square-root function to find the square root of 9

## [1] 3

log(100) # Call the base-e logarithmic function to find log of 100

## [1] 4.60517

exp(2) # Call the exponential function to calculate e to the power of 2

## [1] 7.389056

abs(2*4-125/5) # Call the absolute value function

## [1] 17

Here are more examples.

Example 1. Use R to calculate numerical summaries of data.

x = c(96, 85, 90, 82, 78) # Data: Scores of a mini class
mean(x) # Calculate the mean of the data in the x vector/array

## [1] 86.2

median(x) # Calculate the median

## [1] 85

sd(x) # Calculate the standard deviation

## [1] 7.014271

      # A video: https://www.youtube.com/watch?v=arzaMpDxYSQ

Example 2. Plot data.

Data are collected from 10 students. The times they spend on study, the scores they get, and their gender information are available.

time = c(2, 5, 0, 2, 8, 10, 6, 4, 15, 6)
score = c(78, 85, 54, 72, 88, 92, 80, 75, 98, 68)
gender = c("M", "F", "F", "M", "F", "F", "M", "M", "M", "M")

plot(score ~ time) # Scatter plot of score versus time

table(gender) # Count each gender. The result is called a frequency distribution.

## gender
## F M 
## 4 6

barplot(table(gender)) # Plot the frequency distribution using a bar graph.
title("I am a title",    # Add a title to the most recent plot
      col.main = "red",  # Choose color
      cex.main = 2,      # Choose size
      font.main = 3      # Choose font
     )

Example 3. Add a line to a plot.

A horizontal, vertical or slant line can be added to an existing plot. The following shows how.

Opinion = c("A", "D", "D", "D", "A", "D", "A", "A", "A", "D", "D", "D", "A", "D")
barplot(table(Opinion)/length(Opinion), ylim = c(0,1))
abline(h = 1:9/10, lty = "dashed", col = "red")  # Add horizontal reference lines

x = c(1, 3, 6, 10, 18)
y = c(3, 5, 8, 15, 21)

plot(y~x)
abline(h = mean(y), v = mean(x)) # Add a horizontal line at the mean of y

                                 # and a vertical line at the mean of x

Example 4. Add two lines with a legend to a plot.

The following code generates 100 data values from an exponential distribution with mean 2. A histogram is made, along with the mean and median of the simulated data.

x <- rexp(100, rate = .5)

hist(x, main = "Mean and Median of a Skewed Distribution")

abline(v = mean(x),   col = 2, lty = 2, lwd = 2) # Add a vertical line at the simulated mean

abline(v = median(x), col = 3, lty = 3, lwd = 2) # Add a vertical line at the simulated median

legend(4.1, 30, c("Simulated Mean", "Simulated Median"), 
       col = 2:3,  
       text.col = 2:3, 
       lty = 2:3, 
       lwd = 2, 
       bty = "n"  # No box around the legend
      ) # Add a legend at location (4.1, 30)

An exercise:

Plot the following functions

\[y=1\]

\[y=log(n)\]

\[y=n\]

\[y=nlog(n)\]

\[y=n^2\]

\[y=n^3\]

\[y=2^n\]

\[y=n!\]

where $n$ is a positive integer, on the same graph.

The Pipe Operator %>%

In mathematics, we may need to evaluate the value of a function named $f$ at a given $x$ value. The obtained value may then be used as an input for another function named $g$. Mathematically, we would do: $g(f(x))$. In R, this can be implemented by the code

\[\text{g(f(x))}\] The R package “tidyverse” allows an alternative:

\[\text{x %>% f() %>% g()}\] which is much more natural than the previous expression. The above expression means to process the input $x$ first by machine $f$ then the output is further processed by the machine $g$. The symbol %>% is called a pipe operator.

100 %>% sqrt() %>% log() # Take the square root of 100 and then take the log of the result.

## [1] 2.302585

Basic Data Structures in R

You can make an array of values called a vector or 1-dimensional array. A vector contains values of the same type (Numeric or character). To make a vector in R, use the function “c”. You can also make a rectangular data table called data frame that contains one or more columns. To make a data frame, use the function “data.frame()”.

Vectors

Here are some examples:

# Two numeric vectors or arrays
v1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
v2 = 1:15 # Quicker

v1 # Print the first array

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

v2 # Print the second array

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

# Extract a value of a vector
v1[3] # Extract the third value of vector v1

## [1] 3

# Create a vector of all zeroes
v = numeric(20) # A vector of 20 zeroes

# Find the length of a vector
length(v1) # length of v1

## [1] 15

# Vectors with missing values
x = c(3, 46, NA, 12, 0, -3, NA, 34) # Two missing values each represented by NA 

# Handling missing values
mean(x) # NA   (I guess "NA" = "Not a Value" or "Not Available")

## [1] NA

mean(x, na.rm = TRUE) # Missing values removed

## [1] 15.33333

sd(x) # NA

## [1] NA

sd(x, na.rm = TRUE) # Missing values removed

## [1] 20.11633

Data Frames

Data frames are rectangular tables with column names and possibly row names. Here is an example.

# A character vector or array
Name = c("Jessica", "Tom", "David", "Alice", "Cathy", "Bob")

# Three numeric vectors
Time = c(3, 4, 4, 1, 5, 7)
Score = c(82, 85, 80, 78, 93, 89)
HighSchoolGPA = c(3.2, 3.4, 3.0, 2.8, 3.8, 3.6)

# A data frame with 3 columns. 
# If a column name has more than one word, try to remove any space.
Student = data.frame(Name, Time, Score, HighSchoolGPA) # Store the data in Student
Student # Print the data frame

##      Name Time Score HighSchoolGPA
## 1 Jessica    3    82           3.2
## 2     Tom    4    85           3.4
## 3   David    4    80           3.0
## 4   Alice    1    78           2.8
## 5   Cathy    5    93           3.8
## 6     Bob    7    89           3.6

# To show the structure of the data
str(Student) # The function str() can show the structure of an R object.

## 'data.frame':    6 obs. of  4 variables:
##  $ Name         : chr  "Jessica" "Tom" "David" "Alice" ...
##  $ Time         : num  3 4 4 1 5 7
##  $ Score        : num  82 85 80 78 93 89
##  $ HighSchoolGPA: num  3.2 3.4 3 2.8 3.8 3.6

# To extract part of the data
Student[1:3, ] # Extract the first 3 rows

##      Name Time Score HighSchoolGPA
## 1 Jessica    3    82           3.2
## 2     Tom    4    85           3.4
## 3   David    4    80           3.0

Student[, 1:2] # Extract the first 2 columns

##      Name Time
## 1 Jessica    3
## 2     Tom    4
## 3   David    4
## 4   Alice    1
## 5   Cathy    5
## 6     Bob    7

head(Student, n = 4) # The function head() extracts the first 4 rows of the Student data frame

##      Name Time Score HighSchoolGPA
## 1 Jessica    3    82           3.2
## 2     Tom    4    85           3.4
## 3   David    4    80           3.0
## 4   Alice    1    78           2.8

tail(Student, n = 3) # Extract the last 3 rows of the Student data frame

##    Name Time Score HighSchoolGPA
## 4 Alice    1    78           2.8
## 5 Cathy    5    93           3.8
## 6   Bob    7    89           3.6

# To access a column in a data frame, use the $ sign.
Student$Name # Print the Name column as a vector.

## [1] "Jessica" "Tom"     "David"   "Alice"   "Cathy"   "Bob"

Student$Score # Print the Score column as a vector. Alternatively, Student[, 3] or Student[, "Score"] will do.

## [1] 82 85 80 78 93 89

# To remove a column, say column 2, do:
Student[, -2]

##      Name Score HighSchoolGPA
## 1 Jessica    82           3.2
## 2     Tom    85           3.4
## 3   David    80           3.0
## 4   Alice    78           2.8
## 5   Cathy    93           3.8
## 6     Bob    89           3.6

# To get the names of all columns of a data frame, use the names() function.
names(Student)

## [1] "Name"          "Time"          "Score"         "HighSchoolGPA"

In addition to vectors and data frames, other data structures include matrices, lists, and factors. A matrix is a rectangular array with rows and columns. To form a matrix, use the matrix() function. A list can hold different objects together. To form a list, use the R function list(). To form a factor, use the function factor().

Lists

A list is an object that can contain various other objects as elements.

Here is an example of lists.

L = list(a = 1:10, b = 1000, c = c("Tom Biden", "Alice Trump")) # The object L is a list of 3 objects
L$a # This will access the element "a" in the list

##  [1]  1  2  3  4  5  6  7  8  9 10

L$c # Access the c-element

## [1] "Tom Biden"   "Alice Trump"

Matrics

A matrix is a rectangular 2-dimensional array with elements that are either all numeric values or all characters. Here is an example of matrices.

x = c(2, 5, 9, 0, 2, 1, 1, 4) # The object x is a numeric vector
M = matrix(x, 2, 4) # A matrix of 2 rows and 4 columns
M

##      [,1] [,2] [,3] [,4]
## [1,]    2    9    2    1
## [2,]    5    0    1    4

M[2, ] # Print the 2nd row as a vector

## [1] 5 0 1 4

M[, 2] # Print the 2nd column as a vector

## [1] 9 0

Writing Your Own Functions

You can write your own functions in R. Every function you write must start with a name. Examples of possible function names are: square, calculate_interest, and greeting. The syntax is

function_name = function(a, b, c, …){

code line 1

code line 2

…

}

where

You have a choice for the function name (which starts with a letter, dot, or underscore) and the list of parameters a, b, c, …, but
you must use the reserved word “function”.
You write your code within the pair of braces.

Here is an example.

# Create the function
f = function(x){
  if (any(x<=0)) {
    return("Input a positive number")
  } else{
    return((1+1/x)^x)
 
  }
}

# Call the function
f(1000000)

## [1] 2.71828

# Plot the function
curve(f, xlim = c(1, 100), ylim = c(2, 3))
# Add a horizontal line at y = e (an irrational number approx. 2.7182828...)
abline(h = exp(1), lty = "dashed", col = "red")

Another example:

Body mass index (BMI) is a measure of body fat based on height and weight that applies to adult men and women. The formula for calculating BMI is

\[BMI = \frac{weight}{height^2}\] where weight is a person’s weight in kilogram and height is their height in meters.

Calc.BMI = function(weight, height){
   bmi = weight/(height^2)
   return(bmi)
  
}

# Call the function
Calc.BMI(80, 1.82)

## [1] 24.15167

A third example:

Write a function that calculates

\[\frac{4}{\pi}\Sigma_{i=1}^{n}\frac{sin((2i-1)x)}{2i-1}\]

for any given $x$ and positive integer $n$.

f = function(x, n){
  if(length(n) != 1){
    return("The number n must be a positive integer!")
  }
  
  s = 0
  for (i in 1:n){
    s = s + sin((2*i-1)*x)/(2*i-1)
  }
   
  return(4/pi*s)
}

# Call the function
f(0.4, 20)

## [1] 1.039534

# Plot the function
curve(f(x, n = 20), from = 0, to = 10, col = "blue", ylab = "S(x)",
  main = "Fourier's Series of the Square Wave")

A fourth example:

The $n!$ can be approximted by the famous Stirling’s formula. \[n!=(\frac{n}{e})^n \sqrt{2\pi n}\]

In R, the built-in function “factorial” can calculate the factorial of any non-negative numbers. For example, the code “factorial(4)” will produce the value of 24.

# Make a function for the approximation.
f = function(n){
  (n/exp(1))^n*sqrt(2*pi*n)
}

curve(factorial, 0, 15, xlab = "n", ylab = "Factorial(n)",
      main = "Stirling's Approximation to Factorials")
curve(f, add = TRUE, col = "red", lty = "dashed")

# The above two curves are very close to each other.

A 5th example:

Write a function that does bubble sorting an array. A reference of bubble sort is https://www.youtube.com/watch?v=xli_FI7CuzA.

bubbleSort = function(numbers, decreasing = FALSE) {
  length = length(numbers)
  
  # The following two loops will sort data in increasing order
  for (loop in 1:length) {            # Handle elements one by one
    for (position in 1:(length-1)) {  # This inside loop will place the targeted element 
                                      # to the right position
      if (numbers[position] > numbers[position + 1]) {
        numbers[position:(position+1)] = rev(numbers[position:(position+1)])  # Swap
      }
    }
  }
  
  if (!decreasing) {
    return(numbers)
  } else {
    rev(numbers)
  }
}


testArray = c(420, 4, 69, 11, 62, -5, 45, 97, 103, 67, -346, 35);

bubbleSort(testArray)

##  [1] -346   -5    4   11   35   45   62   67   69   97  103  420

A 6th example:

## Find Greatest Common Divisor of two numbers (say a & b) using Euclidean algorithm
## Set x = min(a,b) and y = max(a,b). Set x = min(x, y-x) and y = y-x. Repeat until x = y. The GCD(a,b) = x.
gcd=function(v=c(45,75)){
  x=min(v)
  y=max(v)
  if (x==y) {
    return(x)
  } else{
    x=min(x,y-x)
    y = y-x
    return(gcd(c(x, y)))
  }
}

gcd()

## [1] 15

gcd(c(63,54))

## [1] 9

gcd(c(5723, 1261))

## [1] 97

gcd(c(59446125, 81124893))

## [1] 8973

An Exercise

Now write a function that calculates the future value of a savings account. The formula is

\[F = P(1+\frac{r}{n})^{nt}\]

where

$F$ is the future value of the investment,
$P$ is the principle (the amount invested at the begining of the investment),
$r$ is the nominal annual interest rate,
$n$ is the compounding frequency (the number of times interest is compunded in a year), and
$t$ is the length of time of the investment.

# P = principal
# r = interest rate, 
# n = number of compounding frequencies in a year, 
# t = length of investment (usually in years) 
F <- function(P, r, n, t) {
  
}

# Call the function to compute the future value at the end of 10 years for an investment of
# an amount of $5,000 deposited into a savings account with an annual interest rate of 5%, compounded quarterly.

Installing and Loading Packages

R has many freely available packages contributed by users. These packages add extra tools to the R community. To use a package, we must install it and then load it for use. Suppose RStudio has been launched. To install a package, issue the code

install.packages(“XXX”)

on the console, where XXX is the package name.

Installing a package occurs only once. Another way to install a package is to follow this instruction: Click the “Packages” tab at the top of the lower-right window, click the “Install” tab, type the package name, and click the “Install” button. It usually takes a few seconds. The progress of the installing is shown on the console.

To load a package, issue the code

library(XXX)

within the code chunk where the package is needed. If the package is needed in multiple code chunks, you can put the code within the “setup” chunk (the very beginning chunk).

DIY

Question 1

Complete the following code so that it finds the sum of the first 100 positive integers.

S = 0
for (i in 1:100){
  
  
}

S

## [1] 0

Question 2

Given the sequence $F_n$, such that $F_{n + 2} = F_{n + 1} + F_n$ with $F_1 = F_2 = 1, write R code to find the $n$th term for a given $n$.

Question 3

Given the sequence $F_n$, such that $a_{n} = 3*a_{n + 1} - a_{n - 3}$ with $a_{1} = 3, a_{2} = 3$, and $a_{3} = 9$, write R code to find the first $n$ terms mod 17 for a given $n$. Here, “n mod p” represents the remainder of $n$, when divided by $p$. For example, 27 mod 5 equals 2 and 50 mod 17 equals 16. In R, to find 27 mod 5, use the code “27 %% 5”.

  # Define the function
  f = function(n){
    a = c(3, 3, 9)
    for (k in 4:n){
      a[k] = (3*a[k-1] - a[k-3]) %% 17
    }
    
    return(a)
  }

  # Call the function
  f(50) # Does the result suggest any pattern?

##  [1]  3  3  9  7  1 11  9  9 16  5  6  2  1 14  6  0  3  3  9  7  1 11  9  9 16
## [26]  5  6  2  1 14  6  0  3  3  9  7  1 11  9  9 16  5  6  2  1 14  6  0  3  3

A follow-up question: What is the remainder when the 2023rd term is divided by 17?

Generating Samples from a Discrete Distribution using R

How can we simulate the drawing of a sample from a given discrete distribution? Look at an example:

How can we generate a random sample of size 10000 from a distribution that takes value

0 with probability 0.2,
2 with probability 0.1,
6 with probability 0.4,
8 with probability 0.1, and
10 with probability 0.2?

The R code:

n = 10000 # Sample size
v = numeric(n) # Create a template vector v to hold the 10000 values to be genearted

for (i in 1:n){
  u = runif(1)  # Generate a uniformly distributed random number between 0 and 1
  if (u < 0.2){
    x = 0
  } else if (u < 0.3){
    x = 2
  } else if (u < 0.7){
    x = 6
  } else if (u < 0.8){
    x = 8
  } else {
    x = 10
  }
  
  v[i] = x # Store the ith value to the ith place of the vector template
  
}

head(v)

## [1]  2  0 10  0 10  6

table(v)/n  # Create a relative frequency table

## v
##      0      2      6      8     10 
## 0.2033 0.0936 0.3955 0.1014 0.2062

barplot(table(v)/n)  # Create a bar plot

R has a built-in function “sample()” that can be used for sampling from a discrete distribution. Here is how we can redo the previous sampling of 10000 values:

n = 10000
y = sample(x = c(0, 2, 6, 8, 10), 
           size = n, 
           replace = TRUE, 
           prob = c(0.2, 0.1, 0.4, 0.1, 0.2)
          )  # x = population, size = sample size, replace = sampling w/ or w/o replacement, prob = probabilities

# Table the simulation result
table(y)/n

## y
##      0      2      6      8     10 
## 0.2017 0.1011 0.4030 0.0992 0.1950

# Plot the result with a bar graph
barplot(table(y)/n)

Another example: Select a sample of 10 individuals from a population of 1000000 individuals.

y = sample(x = 1000000, 
           size = 10, 
           replace = FALSE
          ) # Sampling Without replacement. Each of the 1000000 has the same chance of being selected.
          
y

##  [1] 907801  78216 262822 617092 264641 302963 650895 214017  96053 195564

How can we simulate the drawing of a sample from a binomial distribution? Look at an example:

# Flip a fair coin 100 times. Repeat the process 25 times.
y = rbinom(n = 25,
           size = 100,
           prob = 0.5
          ) 

y

##  [1] 53 45 44 48 63 57 46 48 49 50 51 47 45 46 47 44 47 49 42 53 55 51 42 52 44

# Table the simulation results
table(y)

## y
## 42 44 45 46 47 48 49 50 51 52 53 55 57 63 
##  2  3  2  2  3  2  2  1  2  1  2  1  1  1

# Plot the result as a bar graph
barplot(table(y))

The number of earthquakes in a year in an area can be modeled by a Poisson distribution with mean 2.3. The following is a simulation of the number of earthquakes in the future 500 years.

y = rpois(n = 500,
          lambda = 2.3
         ) 

# Table the simulation results
table(y)

## y
##   0   1   2   3   4   5   6   7   8 
##  38 143 126  97  53  26  11   4   2

table(y)/500

## y
##     0     1     2     3     4     5     6     7     8 
## 0.076 0.286 0.252 0.194 0.106 0.052 0.022 0.008 0.004

# The probability mass function
dpois(0:10, lambda = 2.3)

##  [1] 0.1002588437 0.2305953406 0.2651846416 0.2033082253 0.1169022295
##  [6] 0.0537750256 0.0206137598 0.0067730925 0.0019472641 0.0004976342
## [11] 0.0001144559

# Plot the result as a bar graph
barplot(table(y), xlab = "Number of Earthquakes", ylab = "Frequency")

Intelligence quotient (IQ) can be modeled by a normal distribution with mean 100 and standard deviation 15. Here is a simulation of drawing 50 individuals from such a population.

# Drawing a sample of size 50 from N(mean = 100, sd = 15)
y = rnorm(n = 50,
          mean = 100,
          sd = 15
         )

# Create a frequency histogram for the data      
hist(y, 
     xlab = "IQ", 
     main = "Histogram of IQ Scores"
    )

# Create a probability histogram for the data      
hist(y, 
     xlab = "IQ", 
     main = "Histogram of IQ Scores",
     probability = TRUE,
     ylim = c(0, 1/(sqrt(2*pi)*15))
    )

# Superimpose the population pdf on the density histogram
curve(dnorm(x, 100, 15),    # x is a placeholder
      lty = "dashed",
      col = "red",
      add = TRUE
     )

The life time of an electronic component can be modeled by an exponential distribution with mean 1000 hours. Here is a simulation of drawing 50 components from such a population.

# Drawing a sample of size 50 from Exponential(mean = 1000)
y = rexp(n = 50,
         rate = 1/1000
        )
        
# Create a probability histogram for the data   
hist(y, 
     xlab = "Life Time", 
     main = "Histogram of Life Times of Electronic Components",
     probability = TRUE,
     ylim = c(0, 1/1000)
    )

# Superimpose the population pdf on the histogram
curve(dexp(x, rate = 1/1000),   # x is a placeholder
      lty = "dashed",
      col = "red",
      add = TRUE
     )

The following is an example of solving an interview problem using simulation:

Suppose we play a game in which I give you a die to roll and offer to pay you in dollars the number that you roll (if you roll a 3, I’ll pay you $3). I also stipulate that if you’re dissatisfied with the number that you roll on your first try, I’ll let you roll the die again with the same offer. I will allow a maximum of 3 “do-overs”, though you can elect to stop after any roll.

How much would you pay me to play this game?

N = 10000    # To play the game N times
earning = NULL
for (i in 1:N){
  k = sample(1:6, 1)  # Roll the die
  j = 2
  if (k <= 3 & j <= 3) {
    k = sample(1:6, 1)  # Roll the die again
    j = j + 1
  }
  earning[i] = k
}

table(earning)

## earning
##    1    2    3    4    5    6 
##  814  809  760 2558 2516 2543

mean(earning) # This is the answer

## [1] 4.2782

Generating Random Numbers from a Continuous Distribution: The Inverse Transform Method

The support of a random variable is the set in which the random variable can take a value. For a continuous random variable $X$ with a support $I$, if its cumulative distribution function $F(x)$ is strictly increasing in $I$, the inverse of $F(x)$ can be found by the following steps:

Set $y=F(x)$ for $x\in I$,
Solve for $x$ in terms of $y$,
Switch $x$ and $y$, and
Replace $y$ by $F^{-1}(x)$.

Example 1. The exponential distribution with mean $\frac{1}{\lambda}$ has the probability density function $f(x)=\lambda e^{-\lambda x}, ~~x>0$. The cumulative distribution function is $F(x)=1-e^{-\lambda x}, ~~x>0$. Find the inverse of $F(x)$.

Solution.

First of all, the cumulative distribution function $F(x)=1-e^{-\lambda x}, ~~x>0$ is strictly increasing.

Set $y=F(x)$: $y = 1-e^{-\lambda x}$
Solve for $x$ in terms of $y$: $1-y=e^{-\lambda x}$, or $ln(1-y)=-\lambda x)$, or $\frac{ln(1-y)}{-\lambda}=x$
Switch $x$ and $y$: $y=\frac{ln(1-x)}{-\lambda}$.
Replace $y$ by $F^{-1}(x)$: $F^{-1}(x) = \frac{ln(1-x)}{-\lambda}$.

Let $F(x)$ be a strictly increasing cumulative distribution function. Let $U$ be a random variable having a uniform distribution in $[0, 1]$. Then, $F^{-1}(U)$ is a random variable with $F(x)$ as its cumulative distribution function.

Here is a proof:

For $x\in I$, \[P(F^{-1}(U)\le x)=P(U\le F(x))=F(x).\] That is, the random variable $F^{-1}(U)$ has the desired cumulative distribution function.

The above result says that to draw/generate a random number from a strictly increasing cumulative distribution $F(x)$, we only need to draw a number from the uniform distribution in $[0, 1]$ (called the standard uniform distribution) and then compute $F^{-1}(U)$. This is the so-called inverse transform method for generating random numbers.

The following code generates a 100 random numbers (called a sample) from the exponential distribution with mean 2.3.

# Generate 100 numbers from the uniform distribution in the interval [0, 1]
U = runif(n = 100, min = 0, max = 1)
  
# Generate 100 numbers from the exponential distribution with mean 2.3
lambda = 1/2.3
X=log(1-U)/(-lambda)
# Create a probability histogram for the generated data vector X
hist(X, probability = TRUE, xlab = "x", ylim = c(0, lambda))
curve(dexp(x, rate = lambda), add = TRUE, lty = "dashed", col = "red")

When the cumulative distribution function $F(x)$ with support $I$ is not strictly increasing in its support, the inverse function of $F(x)$ can not be defined as described above. An extended inverse is given by \[F^{-1}(y)=min\{x\in I|F(x)=y\}.\]

Can I use the inverse transform method for generating random numbers from a discrete distribution with support $I$? Yes, but the method needs to be modified. Here are the steps:

Step 1: Generate a random number U from the standard uniform distribution.

Step 2: Compute $min\{x\in I | F(x)>U\}$.

A proof:

Assume that $I={x_1, x_2, ...}$, a finite or infinite set, with $x_1<x_2<x_3<\cdots$.

For $i=1$, \[P(min\{x\in I | F(x)>U\}=x_i)=P(F(x_1)>U)=x_1\], For $i>1$, \[P(min\{x\in I | F(x)>U\}=x_i)=P(F(x_i)>U \text{and} F(X_{i-1})\le U)=P(F(X_{i-1})\le U<F(x_i))\],

\[\begin{align*} P(min\{x\in I | F(x)>U\}=x_i) & = P(F(x_i)>U \text{and} F(X_{i-1})\le U) \\ & = P(F(X_{i-1})\le U<F(x_i)) \\ & = F(x_i) - F(X_{i-1}) \\ & = (x_1 + x_2 + \cdots + x_i) - (x_1 + x_2 + \cdots + x_{i-1}) \\ & = x_i \end{align*}\]

Example 2.$ Generate 1000 random numbers from the Poisson distribution with mean 3.4. The cumulative distribution function is \[F(x)=\sum _{j=0}^{\left[x\right]}\frac{\lambda^j}{j!}e^{-\lambda}\]

where $[x]$ is the largest integer no greater than $x$.

# Generate 1000 numbers from the uniform distribution in the interval [0, 1]
n = 1000
U = runif(n = n, min = 0, max = 1)
  
# Generate 100 numbers from the Poisson distribution with mean 3.4
lambda = 3.4
X = numeric(n)

for (i in 1:n){

  S = 0 # Initialization
  j = 0 # The smallest possible value of the Poisson distribution
  
  while (S<=U[i]){
    S = S + lambda^j/factorial(j)*exp(-lambda)
    j = j + 1
  }
  X[i] = j - 1
}

# The relative frequency distribution of the generated data
table(X)/n

## X
##     0     1     2     3     4     5     6     7     8     9    10    11 
## 0.026 0.113 0.198 0.218 0.178 0.134 0.070 0.037 0.020 0.003 0.002 0.001

# The thoeretical probabilities
dpois(0:10, lambda = lambda)  # Match very well

##  [1] 0.033373270 0.113469118 0.192897500 0.218617167 0.185824592 0.126360723
##  [7] 0.071604409 0.034779285 0.014781196 0.005584007 0.001898563

Generating Random Numbers from a Continuous Distribution: The Rejection Method

When the inverse of a cumulative distribution function of a continuous random variable is difficult to find, the inverse transform method fail to work. The rejection method we will introduce uses a a new distribution (called the candidate) whose probability density function $g(x)$is easy to handle.

Here is how the method works:

Choose a probability density function $g(x)$ with the same support as the target distribution whose proability is $f(x)$, such that $\frac{f(x)}{g(x)}\le c$, for all $x$ in the common support. Here $c>1$ is a constant, the closer to one the more efficient the method.
To generate a random number from the target distribution, repeat

\[\begin{align*} x & \sim g(x) \\ \alpha & = \frac{f(x)}{g(x)}\frac{1}{c} \\ u & \sim Uniform(0,1) \end{align*}\]

until $u\le \alpha$.

It can be shown that the the method has an acceptance rate of $\frac{1}{c}$.

The following is the code for generating $N=10000$ values from a pdf $f(x) = 0.5x, 0<x<2$. The proposal density is $g(x) = 0.5, 0<x<2$, the uniform distribution on (0, 2).

f = function(x) {

   return( 0.5*x )

}      # The function f is the pdf of the target distribution whose support is the interval (0, 2).

N=10000 # Size of the sample to generate from a distribution with pdf f(x).

y = numeric(N) # Initial states of the N values stored in the y object
i = 0 # The object i will hold the number of candidate values before N valid values are generated. These candidates are generated from a proposal distribution.
n = 0 # The object n controls the iterations
while(n<=N){
  i = i + 1
  x = runif(1, 0, 2) # The proposal distribution is the uniform(0,2) which has the same support as the desired distribution.
  u = runif(1)
 
  c = 2 # The maximum value of the ratio of f(x) and g(x), where g(x) is the pdf of the proposal distribution and is 0.5 on the interval (0,2).
   
  alpha = f(x)/0.5*1/c   # Read my lecture notes
  
  if (u <= alpha){
    n = n + 1  # The number of accepted values increases by 1
    y[n] = x   # Accept x and store it in the vector y
  }
}
cat("The acceptance rate: ", N/i, ".")

## The acceptance rate:  0.4988029 .

hist(y, probability = TRUE, xlab = "x", main = "Histogram of the Generated Sample")
curve(f, add = TRUE, lty = "dashed", col = "red")

When the target distribution has a finite support $[a, b]$ and $f(x)\le c$, the uniform distribution in $[a,b]$ can be used as the candidate distribution. That is, $g(x)=\frac{1}{b-a}, ~~a\le x\le b$. Then, the rejection method becomes:

Generate $X$ from the uniform distribution on $[a, b]$ and $Y$ from the uniform distribution on $[0, c]$.
If $Y \le f(X)$, accept $X$. Otherwise, repeat the procedure.

Chapter 6. Stochastic Processes

A stochastic process is a function of two variables $X(t, \omega)$, where $t\in T$ represents time and $\omega\in \Omega$ represents an outcome. Let’s say you flip the same coin every day at 1:00 pm. If you get a head, you do 100 sit-ups; if you get a tail, you do 30 sit-ups. Let $X(t, \omega)$ represent the number of sit-ups you do in day $t$ depending on the outcome of flipping a coin. Then, for a fixed day $t$, $X$ is a function of $\omega$ only and thus is a random variable taking two values 100 and 30 (called states) each with chance 0.5. For a fixed outcome $\omega$, $X$ is a function of day $t$ and is not a random variable, representing a sample path or a trajectory or a realization of the process $X(t, \omega)$.

Classifications of Processes

Processes can be classified according to the type of time and the type of the state.

time	state	process	example
continuous	continuous	continuous-time and continuous-state	CPU usage, air temperature
continuous	discrete	continuous-time and discrete-state	the number of bacteria in a container
discrete	continuous	discrete-time and continuous-state	daily sales
discrete	discrete	discrete-time and discrete-state	daily spam emails received

In general, the outcome $\omega$ in $X(t, \omega)$ can be complex and usually is of little interest, so it is often omitted, and we write a process simply as $X(t)$ or $X_t$.

In this course, we only focus on discrete-time and discrete-state processes. For such a process, the time domain $T={0, 1, 2, 3, \cdots, }$. Instead of $t$, we use $n$ to denote time, and the process is denoted as $X(n)$ or $X_n$.

Markov Processes and Markov Chains

When the distribution of any future of a process depends only on the most recent behavior of the process, the process is said to be Markovian. A discrete-time and discrete-state Markov process is called a Markov chain.

We call $P(X_{n+h}=j|X_n = i)$ the $h$-step transition probability and denote it by $p_{ij}^{(h)}(n)$. When such probability does not depend on time $n$, we write $p_{ij}^{(h)}$. When $h=1$, we write $p_{ij}$, called the 1-step transition probability or just transition probability.

The initial distribution of a Markov chain is usually known. it is denoted by $P_0 (x) = P(X(0) = x)$.

There are three useful questions regarding a Markov chain:

How can we compute $p_{ij}^{(h)}$?
What is the distribution of the process $h$-step later? This distribution is denoted by $P_h (x) = P(X_h = x)$. The mean of this distribution can be used to forecast the state of the process $h$-step later. The limit of this distribution is called the steady-state distribution of the Markov chain and it is denoted by $\pi$.
What is the limit of the transition probability $p_{ij}^{(h)}$ as $h\rightarrow \infty$? This can be used when finding the long-term forecast for the process.

The 1-step transition probabilities can be placed in a matrix, called the transition probability matrix, denoted by $P$. The $h$-step transition matrix, denoted by $P^{(h)}$, can be found by multiplying the $P$ matrix $h$ times; that is $P^{(h)}=P^h$. The distribution of the process $h$-step later can be found by multiplying the initial distribution by the $h$-step transition matrix.

To find the steady-state distribution, we can solve the matrix equation $\pi P=\pi$. This equation suggest that $\pi$ is the eigen-vector of the transpose of the $P$ matrix corresponding to the eigen-value 1. Recall that a vector $v$ is said to be an eigen-vector of a square matrix $A$, if there exists a scalar $\lambda$ such that $Av=\lambda v$.

All of computations can be done using R.

Example 1. The price of a stock changes day to day. Each day is either up, flat, or down. A “up” day is followed by another “up” day with probability 0.4, a “flat” day with probability 0.13, and a “down” day with probability 0.47. A “flat” day is followed by another “flat” day with probability 0.1, a “up” day with probability 0.45, and a “down” day with probability 0.45. A “down” day is followed by another “down” day with probability 0.38, a “flat” day with probability 0.16, and a “up” day with probability 0.46.

Today, the stock price is “down”.

Use notation to denote the process clearly. Is it a Markov chain?
Construct a transition diagram. A transition diagram can show the state transitions along with the corresponding probabilities.
What is the transition matrix?
What is the initial distribution of the process?
Find the 2-step transition matrix.
Find the 3-step transition matrix.
Find the distribution of the process 3 days later.
What is the steady state distribution? This distribution is the long-term forecast of the stock price.
How would you plot the sample path of the process? That is, how can you plot the probability of each state versus day?

Solution.

Let $X(n)$ represent the state (“up”, “flat”, “down”) of the stock price in day $n$. It is a Markov chain, since the future state of the process depends only on the immediate previous state.
The transition matrix is given by \[P = \begin{bmatrix} 0.40 & 0.13 & 0.47 \\ 0.45 & 0.10 & 0.45 \\ 0.46 & 0.16 & 0.38 \\ \end{bmatrix}\]
The initial distribution of the chain is \[P_0 = (0, 0, 1)\] where 0, 0, and 1 are the probabilities in the order of “up”, “flat”, and “down”.
The 2-step transition matrix is given by \[P^{(2)} = P^2 = \begin{bmatrix} 0.4347 & 0.1402 & 0.4251 \\ 0.4320 & 0.1405 & 0.4275 \\ 0.4308 & 0.1366 & 0.4326 \\ \end{bmatrix}\]

Here is how matrix multiplication is done in R:

P = matrix(c(0.40, 0.45, 0.46, 0.13, 0.10, 0.16, 0.47, 0.45, 0.38), 3, 3)  
P%*%P   # P square

##        [,1]   [,2]   [,3]
## [1,] 0.4347 0.1402 0.4251
## [2,] 0.4320 0.1405 0.4275
## [3,] 0.4308 0.1366 0.4326

The 3-step transition matrix is given by \[P^{(3)} = P^3 = \begin{bmatrix} 0.432516 & 0.138547 & 0.428937 \\ 0.432675 & 0.138610 & 0.428715 \\ 0.432786 & 0.138880 & 0.428334 \\ \end{bmatrix}\]

Check it in R:

P = matrix(c(0.40, 0.45, 0.46, 0.13, 0.10, 0.16, 0.47, 0.45, 0.38), 3, 3)  
P%*%P%*%P   # P cube

##          [,1]     [,2]     [,3]
## [1,] 0.432516 0.138547 0.428937
## [2,] 0.432675 0.138610 0.428715
## [3,] 0.432786 0.138880 0.428334

The distribution of the process 3 days later is given by \[P_3 = P_0 P^3 = (0, 0, 1)\begin{bmatrix} 0.432516 & 0.138547 & 0.428937 \\ 0.432675 & 0.138610 & 0.428715 \\ 0.432786 & 0.138880 & 0.428334 \\ \end{bmatrix}=(0.432786, 0.13888, 0.428334)\]

Check it in R:

P = matrix(c(0.40, 0.45, 0.46, 0.13, 0.10, 0.16, 0.47, 0.45, 0.38), 3, 3) 
pi0 = c(0, 0, 1)
pi0%*%P%*%P%*%P

##          [,1]    [,2]     [,3]
## [1,] 0.432786 0.13888 0.428334

The steady-state distribution is calculated as follows:

We want to find a row vector $\pi=(a, b, c)$, such that $a+b+c=1$ and \[\pi\cdot \begin{bmatrix} 0.40 & 0.13 & 0.47 \\ 0.45 & 0.10 & 0.45 \\ 0.46 & 0.16 & 0.38 \end{bmatrix}=\pi.\]

The matrix equation can be written as \[(0.40a+0.45b+0.46c, 0.13a+0.10b+0.16c, 0.47a+0.45b+0.38c) = (a, b, c)\]

or, \[0.40a+0.45b+0.46c = a ~~\text{and} ~~0.13a + 0.10b + 0.16c = b, 0.47a+0.45b+0.38c=c\] which can be simplified to \[0.45b + 0.46c= 0.60a ~~0.13a + 0.16c = 0.90b ~~0.47a+0.45b=0.62c\]

Eliminating $b$ from the first two equations yields $1.07a = 1.08c$ or $c = 0.99074a$.

Replacing $c$ in the third equation yields \[b = 0.320575a\]

Note that $a+b+c=1$, so $a+0.320575a+0.99074a=1$ or \[a = 0.432654\] which implies that $b=0.138698$ and $c=0.428648$.

To summarize, the steady-state distribution of the process is in the long run, about 43.27% of the time the stock price will go up, about 13.87% of the time the stock price will stay flat, and about 42.86% of the time the stock price will go down.

You can use R to do the calculation:

   P = matrix(c(0.40, 0.45, 0.46, 0.13, 0.10, 0.16, 0.47, 0.45, 0.38), 3, 3)  
   a = eigen(t(P))[[2]][,1]
   a/sum(a)

## [1] 0.4326538 0.1386985 0.4286477

The steady-state distribution may not be unique. When is it unique? The answer is when the Markov chain is regular. A Markov chain is said to be regular, if some power of the transition matrix $P$ contains entries that are all positive. That is, there exists an $h$, such that $p_{ij}^{(h)}>0$ for all $i$ and $j$, which means that any two different states of a Markov chain are likely to communication in finite steps.

Example 2. Does the Markov chain (having states 0, 1, 2, and 3) with the following transition matrix have a unique steady-state distribution?

\[P = \begin{bmatrix} 0 & 0.1 & 0.4 & 0.5 \\ 0.4 & 0.6 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ \end{bmatrix}\]

Solution.

No, since the steady-state distribution has the form $(0,0,0, d, e)$, where $d$ and $e$ are any non-negative numbers such that $d+e=1$.

In addition, the fifth element on the diagonal of the transition matrix is 1, indicating that 3 is an absorbing state (meaning that once the chain is in that state, the chain remains there). Some chains may have a few absorbing states.

Example 3. Does the Markov chain with the following transition matrix have a unique steady-state distribution?

\[P = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}\]

Solution.

Yes, since the steady-state distribution is $(0.5, 0.5)$. But, the limit of $P^{(h)}$ deos not exist, because $P^{(h)}$ takes $P$ or the 2 by 2 identity matrix as values for any $h=1, 2, 3, \cdots$. Since both of the two matrices have zero entries, the chain is irregular. In addition, the chain is periodic with period 2, since $X(n) = X(n+2)$ for all $n$ with probability 1. A periodic Markov chain cannot be regular.

Counting Processes

A stochastic process $X(t)$ is counting if $X(t)$ represents the number of items counted by the time $t$. The items can be black swans, mistakes made, text messages received, customers serviced, and so on. Each sample path of a counting process is always non-decreasing. All counting processes are discrete-state, taking values from the set ${0, 1, 2, 3, ...}$.

Example. The following code chunk shows the times at which an email is received. An email may have or may not have an attachment. There are two counting processes, one for counting emails and the other for counting attachments.

t = c(0, 8, 22, 30, 32, 35, 40, 41, 50, 52, 57)
email = c(0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
attachment = c(0, 1, 0, 0, 0, 5, 0, 0, 2, 0, 0)

cp1 = cumsum(email) # The values of the first counting process
cp2 = cumsum(attachment) # The values of the second counting process

plot(cp1~t, type = "s", xlab = "t (min)", ylab = "X(t)", col = "blue")

plot(cp2~t, type = "s", xlab = "t (min)", ylab = "A(t)", col = "red")

The graph shows that exactly one email is received at each of the times 8, 22, 30, 32, 35, 40, 41, 50, 52, and 57, all in minutes. One attached is with the second email, 5 attachments are with the 6th email, and 2 attachments are with the 8th email.

We will introduce two counting processes: the discrete-time discrete-state Binomial process and the continuous-time discrete-state Poisson process.

The Binomial Process

Recall that any experiment with two outcomes (generically called success/failure) is called a Bernoulli trial.

Let $X(n)$ be the total number of successes in the first $n$ independent Bernoulli trials, where $n = 0, 1, 2, ...$. $X(n)$ is called a Binomial process, which is discrete-time and discrete-state. In addition, the Binomial process is a markov chain.

We already know that $X(n)$ at any fixed time $n$ has a Binomial distribution with parameters $n$ and $p$, where $p$ is the probability of observing a success.

Next, we introduce the modeling of the arrivals of events in a time interval of length $t$ with the Binomial process.

First, we split the time interval into many tiny intervals (called frames) of the same length $\Delta$. We assume that the frames are so small that at most one event can be observed in a frame. The number of frames $n = \frac{t}{\Delta}$. Thus, a Bernoulli trial occurs in each frame. Since $X(n)$ represents the total number of successes (or arrivals), it has a Binomial distribution with paramter $n$ and $p$ (the probability of success). The mean of $X(n)$ is $n\cdot p = \frac{t}{\Delta}p$. This is mean number of successes in a time interval of length $t$. When this mean is divied by $t$, we get the rate of arrivals $\frac{p}{\Delta}$, which is the average number of successes per one unit of time. Denote this arrival rate by $\lambda$ and we have $\lambda = \frac{p}{\Delta}$.

Let $T$ denote the time between two consecutive successes (called the inter-arrival time). $T$ is a random variable and can be expressed as $T = Y\Delta$, where $Y$ represents the number of frames between the two consecutive successes. We know that $Y$ has a Geometric distribution with parameter $p$ and its mean is $\frac{1}{p}$ (Chapter 3). The mean of $T$

\[\mu = E(T)=E(Y\Delta)=E(Y)\cdot \Delta=\frac{\Delta}{p}=\frac{1}{\lambda}\]

The variance of $T$ can be calculated as follows:

\[\sigma^2 = Var(T)=Var(Y\Delta)=Var(Y)\cdot \Delta^2=\Delta^2\frac{1-p}{p^2}=\frac{1-p}{\lambda^2}\]

Example 1.

Jobs are sent to a mainframe computer at a rate of 2 jobs per minute. Arrivals are modeled by a Binomial counting process.

What is a frame size that makes the probability of a new job during each frame equal 0.1?
Using the chosen frmaes, compute the probability of more than 3 jobs received during one minute.
Computer the probability of more than 30 jobs during 10 minutes.
What is the mean inter-arrival time, and what is the variance?
Compute the probability that the next job does not arrive during the next 30 seconds.

Solution.

We are given: $\lambda = 2$/min.

We know that $p=0.1$, so $\Delta=\frac{p}{\lambda}=\frac{0.1}{2}=0.05$ minute or 3 seconds.
We know that $t = 1$ min or 60 seconds, so we have $n = \frac{t}{\Delta}=\frac{60}{3}=20$ frames. The number of jobs during this one minute period, denoted by $X(20)$, has a Binomial distribution with parameters $n=20$ and $p=0.1$. We want to find the probability

\[P(X(20)>3)=1-P(X(20)\le 3)=1-0.8670 = 0.1330,\] where the code “pbinom(3, 20, 0.1)” was used to obtain the number 0.8670.

We know that there are $n=600/3=200$ frames. We want to find the probability

\[P(X(200)>30)=1-P(X(200)\le 300)=1-0.9905 = 0.0095,\] where the code “pbinom(30, 200, 0.1)” was used to obtain the number 0.9905.

The mean inter-arrival time is $E(T)=\frac{1}{\lambda}=\frac{1}{2}=0.5$ minute or 30 seconds. The variance of the inter-arrival time is $Var(T)=\frac{1-p}{\lambda^2}= \frac{1-0.1}{2^2}=0.225$ square minute or 810 square seconds.
We wnat to find the probability

\[P(T>0.5~\text{min})=P(Y\Delta>0.5)=P(Y(0.05)>0.5)=P(Y>10)=1-P(Y\le 10)=1-0.6513=0.3487.\] where the code “pgeom(9, 0.1)” or the formula $(1-p)^{10}$ was used to obtain the number 0.6513. Note that for the geometric distribution, R defines the random variable to be the number of failures before the success; that is, R calls the distribution of Y-1 geometric distributed.

The Binomial counting process $X(n)$ is a Markov chain with states 0, 1, 2, 3, …. The transition matrix $P$ has $\infty$ rows and $\infty$ columns, with diagonal elements being all equal to $1-p$ and upper-off diagnal being all equal to $p$. The chain is irregular, since $X(n)$ is non-decreasing and thus $p_{10}^{(h)}=0$ for all $h$. In addition, the chain has no steady-state distribution (solving the equation $\pi P = \pi$ yields no solution).

The Poisson Process

In a binomial process with arrival rate $\lambda$, the time interval $[0, t]$ is divided into many small intervals (called frames) of the same length $\Delta$. The number of events up to time $t$ is denoted by $X(t)$ can be modeled by the binomial distribution with $n = \frac{t}{\Delta}$ trails and success probability $p = \Delta\cdot \lambda$. It can be shown that, when $\Delta$ approaches zero, the binomial distribution can be approximated by the Poisson distribution with rate $\lambda \cdot t$. A counting process $X(t)$ is said to be Poisson, if it has a Poisson distribution with rate $\lambda \cdot t$ in the time interval $[0, t]$ for at any time $t$. Here, $\lambda$ is called the arrival rate of the Poisson process.

The inter-arrival time between two consecutive arrivals of events can be shown to have an Exponential distribution.

Proof:

Let $X(t)$ have a Poisson distribution with rate $\lambda \cdot t$ in the time interval $[0, t]$ at any time $t$. Let T be the inter-arrival time between two consecutive arrivals of events. Then,

\[ \begin{aligned} P(T\le t) &= 1-P(T>t) \\ &= 1-P(\text{No arrival of any new event in a period of time}~ t) \\ &= 1-P(X(t)=0) \\ &= 1-\frac{(\lambda t)^0}{0!}e^{-\lambda t}\\ &= 1-e^{-\lambda t} \end{aligned} \]

which indicates that $T$ has an Exponential distribution with mean $\frac{1}{\lambda t}$.

Example 1. Messages arrive at an electronic message center at random times, with an average of 9 messages per hour. Assume that the arrivals can be modeled by a Poisson counting process.

What is the probability of receiving exactly 6 messages during the next two hours?
What is the probability of receiving at least 6 messages during the next two hours?

Solution.

$\lambda = 9$/hour, $t=2$, so \[P(X(2)=6)=\frac{(9 \cdot 2)^6}{6!}e^{-9 \cdot 2}=0.0911\] You can also use the R code: dpois(6,9) to get the same result.
$\lambda = 9$/hour, $t=2$, so \[ \begin{aligned} P(X(2)\ge 6)&=1-P(X(2)\le 5)\\ &=1-(\frac{(9 \cdot 2)^0}{0!}e^{-9 \cdot 2}+ \frac{(9 \cdot 2)^1}{1!}e^{-9 \cdot 2}+ \frac{(9 \cdot 2)^2}{2!}e^{-9 \cdot 2}+ \frac{(9 \cdot 2)^3}{3!}e^{-9 \cdot 2}+ \frac{(9 \cdot 2)^4}{4!}e^{-9 \cdot 2}+ \frac{(9 \cdot 2)^5}{5!}e^{-9 \cdot 2})\\ &=0.9997 \end{aligned} \] You can also use the R code: 1-ppois(5,9*2) to get the same result.

An Application of the Poisson process in bitcoin: https://www.ems-ph.org/journals/show_pdf.php?issn=1027-488X&vol=3&iss=115&rank=8

Simulation of Stochastic Processes

R code for simulating the Binomial process:

Sim.BinomialProcess = function(n=20, p=0.5){ ## Define a function
  U=runif(n) ## Generate n Uniform(0,1) random variates
  Bernoulli = (U<p) ## If any of the above variates is less than p, convert it to 1 (success)
  bp = cumsum(Bernoulli) ## The cumsum function in R gives the cumulative sums                                       
                         ## For example, cumsum(1:5) would give 1, 3, 6, 10, 15
                         ## So, the function counts numbers of successes
  plot(1:n, bp, xlab= "", xaxt= "n", ylab = "X(n)", 
    main = "Simulation of a Binomial Counting Process") ## If xaxt = “n”, there is no tick marks on the x-axis 
  axis(1, at=1:n, labels= ifelse(Bernoulli, "S", "F")) ## Here 1 indicates the x-axis.           
                                                       ##Place  labels on the x-axis: if an element of Bernoulli is non-zero,
                                                       ## put an “S”
  axis(2, at = 0:max(bp))  ## On the y-axis (indicated by 2),                                       
                           ## put integers from 0 to max of bp
}

Sim.BinomialProcess(n=30, p=0.3) ## Use the defined function

R code for simulation of a Brownian Motion (a continuous-time and continuous-state process, textbook page 159):

n= 1000
normvec = c(0, rnorm(n)) ## The process starts at  0, with independent normal increments
x= cumsum(normvec)
plot(x, xlab = "t", ylab = "X(t)", main = "Simulation of a Brownian Motion Process")

R code for simulation of Poisson processes:

tLimit = 1000
lambda = 0.04 # Arrival rate
a=0
T= NULL ## to hold arrival times

while (a <= tLimit){    
   Y = -1/lambda*log(runif(1)) ## Generate an inter-arrival time from Exponential(0.04) distribution    
   a = a + Y ## Find the arrival time    
   T = append(T, a) ## The T values are like those S values shown
}

X = NULL
V = seq(0, tLimit, by=0.05)

for (t in V){
   X = append(X, sum(T<=t))
}

plot(V, X, type = "s", xlab= "t", ylab = "X(t)",
   main = "Simulation of a Poisson Process")  ## Trajectory of the process

Chapter 7 Queneing Systems

A queueing system is a facility consisting of a queue of jobs (tasks) waiting to be processed (performed) and one or several servers designed to process the jobs (perform the tasks).

Pictures that show q queneing system:

http://staff.um.edu.mt/jskl1/simweb/intro.htm

http://www.me.utexas.edu/~jensen

Examples of queueing systems are:

A medical office serving patients
A fast food drive-through lane
An automated teller machine (ATM) in a bank
A customer service with one or more representatives answering calls from customers
A printer processing jobs

The Kendall notation (1953) for classifying queueing systems is $A/B/s/q/c/p$, where:

$A$ is the arrival pattern (distribution of the time between two consecutive arrivals),
$B$ is the service pattern (distribution of the service time),
$s$ is the number of servers,
$q$ is the queuing discipline (by default, FIFO for First In First Out),
$c$ is the system capacity (infinity if omitted),
$p$ is the population size (number of possible customers). Omitted for open systems.

Examples are $M/M/1$ and $G/G/1$, where the first $M$ indicates the arrival process of jobs is Poisson and thus has exponential inter-arrival times (Markovian or memoryless), the second $M$ indicates that the service time is also exponential, and $G$ is a general distribution.

Jobs arrive at a queueing system at random times. A counting process, denoted by $A(t)$, records the number of arrivals by time $t$. In a stationary queueing system whose distribution characteristics do not change over time, arrivals occur at a constant rate $\lambda=(E[A(t)])/t$, which is the expected number of arrivals per unit time. The expected time between arrivals then is $\mu=1/\lambda$.

Jobs are typically processed according to the order of their arrivals on a “first come, first served” basis. Suppose a new job arrives. If any server is available, it will take the new job. If more than one serer is available, the job may be randomized to any of the servers. If no server is available, the new job will join the queue, wait, and get routed to the next available server. We will focus on a single server. Let $\mu_S$ be the mean service time. The number of jobs processed during one unit of time (the service rate) is $\lambda_S=1/\mu_S$.

Performance Measures of a Queueing System

$\lambda_A=$ Arrival rate
$\lambda_S=$ Service rate
$\mu_A=\frac{1}{\lambda_A}=$ Mean arrival time
$\mu_S=\frac{1}{\lambda_S}=$ Mean service time
$r=\frac{\lambda_A}{\lambda_S}=$ Arrival-service ratio (or utilization)

Random Variables That Record a Queueing System

$X_s (t)=$ number of jobs receiving service at time $t$
$X_w (t)=$ number of jobs waiting in a queue for service at time $t$
$X(t)=X_s (t)+X_w (t)=$ number of jobs in the system at time $t$
$S_k =$ service time of the kth job
$W_k =$ waiting time of the kth job
$R_k=S_k+W_k =$ response time of the $k$th job (Total time spending in the system from arrival until departure)

The Little’s Law

MIT professor John Little (1961) discovered that, for any stationary queueing system, \[E(X)=\lambda_A E(R)\] which is called the Little’s Law. The law also applies to the system’s components-The queue and the server: \[E(X_w )=\lambda_A E(W)\] \[E(X_s )=\lambda_A E(S)=\lambda_A \mu_S=r\] The second equation gives a new interpretation of the utilization: utilization is just the mean number of jobs being processed (or, for a single-server case, the probability that the server is busy).

Bernoulli Single-Server Queueing Process with Unlimited Capacity

Consider the queueing system having the following characteristics:

One server
Unlimited capacity
Arrivals occur according to a Binomial process with arrival probability $p_A$ in each frame
The probability of a service completion (and a departure) during each frame is $p_S$, provided that there is at least one job in the system at the beginning of the frame
Service times and inter-arrival times are independent

Then, we have the following results:

The number of frames between successive arrivals has a Geometric$(p_A)$ distribution.
The number of frames each service takes has a Geometric$(p_S)$ distribution.
$p_A=λ_A ∆, p_S=λ_S ∆$
The process for the number of jobs in the system at time $n$, denoted by $X(n)$ or $X_n$, is an irregular Markov chain with a transition diagram:

and a $\infty×\infty$ transition probability matrix having non-zero elements

\[ \begin{align*} p(0→0)&=1-p_A\\ p(0→1)&=p_A\\ p(i→i-1)&=p_S (1-p_A)\\ p(i→i+1)&=p_A (1-p_S)\\ p(i→i)&=(1-p_S )(1-p_A )+p_A p_S \end{align*} \]

for $i=1,2,\cdots$.

The Markov chain $X(n)$ has a steady-state distribution, provided that $\lambda_A<\lambda_S$ (no overload).

Example 1. Suppose that jobs are sent to a printer at a rate of 20 per hour, and that it takes an average of 40 seconds to print each job. Currently, the printer is printing a job, and there is another job stored in a queue. Assume a Bernoulli single-server queueing process with 20-second frames.

Computer $p_A$ and $p_S$.
Determine the probability transition matrix for the process of the number of jobs $X(n)$ in the system in the $n$th frame, $n=0,1,2, \cdots$.
What is the probability distribution of the number of jobs in the 6th frame?
What is the probability that the printer will be idle in the 6th frame?
Find the expected number of jobs being processed in the 6th frame.
Find the expected total number of jobs in the system in the 6th frame.
Find the expected number of waiting jobs in the 6th frame.

Solution.

We are given: $\lambda_A=20$ jobs/hour = (1/3) job/min, $\lambda_S=1/\mu_S =1/40$ job/sec = 1.5 jobs/min, and $\Delta = 1/3$ min. So, $p_A=\lambda_A \Delta=(1/3)(1/3)=1/9$ and $p_S=\lambda_S \Delta=(1.5)(1/3)=1/2$.

To determine the transition probability matrix $P$, we calculate

\[\begin{align*} p(0→0)&=1-p_A=8/9=0.889\\ p(0→1)&=p_A=1/9\\ p(i→i-1)&=p_S (1-p_A)=1/2(1-1/9)=4/9=0.444\\ p(i→i+1)&=p_A (1-p_S )=1/9 (1-1/2)=1/18=0.056\\ p(i→i)&=(1-1/2)(1-1/9)+(1/9)(1/2)=1/2=0.5. \end{align*} \]

for $i=1,2,\cdots$ So, the matrix $P$ is 3-diagonal and looks like

\[P = \begin{bmatrix} 0.889 & 0.111 & 0 & 0 & 0 & 0 & \cdots \\ 0.444 & 0.5 & 0.056 & 0 & 0 & 0 & \cdots \\ 0 & 0.444 & 0.5 & 0.056 & 0 & 0 & \cdots \\ 0 & 0 & 0.444 & 0.5 & 0.056 & 0 & \cdots \\ \vdots &\vdots & \vdots & \vdots & \vdots & \vdots & \vdots \end{bmatrix}\]

The condition “Currently, the printer is printing a job, and there is another job stored in a queue.” Suggests that the initial distribution of $X(n)$ is $p_0=(0,0,1,0,0,\cdots)$. We want to find $p_6=p_0 P^6$. Since there are only two jobs in the system currently, there can be at most 8 jobs in the system after 6 frames. Thus, it is sufficient to consider the first 9 rows and 9 columns of the matrix P. Using R software, we first find $P^6$, then we find $p_6$. The answer is \[p_6=p_0 P^6=(0.644,0.25,0.08,0.022,0.004,0,0,0,0).\]

There are 6 frames in 2 minutes. That the printer is idle is equivalent to “the system has no jobs.” So, the probability that the printer will be idle in 2 minutes is 0.644.

Since the printer processes at most one job at a time, the number of jobs being processed in the 6th frame, $X_s (6)$, takes on values 0 or 1, and has a Bernoulli distribution with $P(X_s (6)=1)=P($Printer is busy$)=1-0.644=0.356$. So, the expectation of $X_s (6)$ is \[E(X_s (6))=(0)(0.644)+(1)(0.356)=0.356.\]

There are 6 frames in 2 minutes. The distribution of $X(6)$ is $(0,1,2,3,4,5,6,7,8,\cdots)$ with probabilities $p_6=(0.644,0.25,0.08,0.022,0.004,0,0,0,0,\cdots)$, so the expected number of jobs,$E(X(6))$, in the 6th frame, is the sum of cross products $(0)(0.644)+(1)(0.25)+\cdots+(0)(0)=0.494.$

Since $X(6)=X_s (6)+X_w (6)$, the expected number of waiting jobs in the 6th frame is \[E(X_w (6))=E(X(6))-(X_s (6))=0.494-0.356=0.138.\]

For Bernoulli single-server queueing processes, a $k$-step transition from state 0 to state $(k+1)$ is impossible, since it requires at least $(k+1)$ arrivals while at most one arrival is possible in each transition. Thus, Bernoulli single-server processes are all irregular Markov chain.

Any system whose service rate exceeds the arrival rate (that is, no overload) does have a steady-state distribution.

Bernoulli Single-Server Queueing Process with Limited Capacity $C$

Many queueing systems may have limited resources for storing jobs. Suppose a maximum number of $C$ jobs can be stored in a queueing system. The number $C$ is called capacity. Such a system is pictured below.

For states $0, 1, 2, \cdots, C,$ the transition probabilities are the same, except $p_{C,C}=1-p_{C,C-1}=1-(1-p_A)p_S$, since no transition from $C$ to $C+1$ is possible. The transition matrix for the case $C=2$ is

\[P = \begin{bmatrix} 1-p_A & p_A & 0 \\ (1-p_A)p_S & (1-p_A)(1-p_S) + p_Ap_S& p_A(1-p_S) \\ 0 & (1-p_A)p_S & 1-(1-p_A)p_S \end{bmatrix}\]

Example 2. Suppose a telephone has two lines. A customer service representative (the server) can talk to one customer with another on hold. This is a queueing system with capacity $C=2$. Suppose the representative gets an average of 10 calls per hour, and the average phone conversation lasts 4 minutes. Modeling the number of customers $X(n)$ at time n by a Bernoulli single-server queueing process with limited capacity and 1-munute frames, compute the transition matrix.

Solution.

We have $\Delta = 1$ min, $\Delta_A =10$/hour = $1/6$ /min, and $\lambda_S = 1/4$ /min. So, $p_A =\Delta\lambda_A=1/6$ and $p_S=\Delta\lambda_S=1/4$. Plugging these probabilities into the above 3 by 3 matrix gets

\[P = \begin{bmatrix} 5/6 & 1/6 & 0 \\ 5/24 & 2/3& 1/8 \\ 0 & 5/24 & 19/24 \end{bmatrix}\]

$M/M/1$ Queueing System

When the frame delta gets smaller and smaller, the Bernoulli single-server system approaches the $M/M/1$ system which is a continuous-time discrete state system with the following steady-state distribution:

\[π_x=P(X=x)=r^x (1-r),\] for $x=0,1,2, \cdots$ \[P(X≤x)=1-r^(x+1), \] for $x=0,1,2,\cdots$

where $X$ is the number of jobs in the system.

With the help of Little’s Law, we obtain the following performance characteristics:

\[E(R)=μ_S/(1-r)\]

\[E(W)=(μ_S r)/(1-r)\]

\[E(X)=r/(1-r)\]

\[E(X_w )=r^2/(1-r)\]

\[E(X_s )=r\]

\[P(\text{Server is busy})=r=\text{fractions of customers waiting for service}.\]

Example 3. Consider messages arriving to a communication center at random times with an average of 5 messages per minute. The messages are transmitted through a single channel in the order they were received. On average, it takes 10 seconds to transmit a message. Model the system with an $M/M/1$ queue.

What is the probability that the center is busy?
1. How often does the number of messages in the center exceeds 2?
2. What is the expected waiting time for each message before transmission?
3. On average, how many messages are waiting for transmission at each moment?
4. What is the average time for a message to be in the center?
5. On average, how many messages are being transmitted at each moment?

Solution.

We are given $\lambda_A=5$ messages/min and $\mu_S=10$ seconds = 1/6 min. $\lambda_S=6$ messages/min.

The probability that the center is busy is the same as $r$, the utilization, which is $\lambda_A/\lambda_S =5/6$.
We need to find $P(X>2)$. Since $P(X≤x)=1-r^(x+1)$, $P(X>2)=1-P(X≤2)=1-r^3=1-(5/6)^3=91/216=0.4213.$
We need to find $E(W)$. $E(W)=(μ_S r)/(1-r)=((1/6)(5/6))/(1-5/6)=5/6$ min, or 50 sec.
We need to find $E(X_w)$. $E(X_w )=r^2/(1-r)=(5/6)^2/(1-5/6)=25/6=4.17$ messages
We need to find $E(R)$. $E(R)=μ_S/(1-r)=(1/6)/(1-5/6)=1$ min
We need to find $E(X_s)$. Since $E(X_s )=r$, the answer is $5/6$ or 0.83 message.