Lecture for NBC (R and Probability Theory)

Author

Daiju Aiba

Published

August 6, 2024

1 Probability Theory

1.1 Definition of probability function

Here, I would like to give several important theorems and propositions about probability theory. The probability function f(.) must satisfie the following postulates

  • If A is any event in the sample space \Omega, then 1≥P(A)≥0
  • P(\Omega)=1
  • Let A_i be an event in \Omega, and let each event be mutually exclusive $ A A_j =$. Then, P(A_1 \cup...\cup A_k )=\Sigma_i^k P (A_i)
Terminologies of Probability Theory
Important

Chance or random) Experiment : a planned operation carried out under controlled, and the results are not predetermined. (e.g., throwing dices, and flipping a coin.)

Outcome: the results of experiment.

Equally likely: the chance of each outcomes occurring with equal probability.

  • For example, if we flip a coin, the head and tale could happen with the equal probability.  

Event: any combination of outcomes.

  • Probability of occurrence of event A could be written as P(A).
  • For example, in the experiment of throwing one dice,
  • Event A:  “A roll of the dice is one” ⇒ P(A) = 1/6
  • Event B:  “A roll of the

Sample space : a set of all the possible events (outcomes).

  • If there are two dices, we can assume that there are 36 combinations of outcome of two dices

  • Sample space is often denoted by \Omega

  • If Event A is included in Event B, we can denote it as A\subset B

1.2 Consequence of the postulates

If the function satisfies those postulates, then we obtains the following properties.

Complement rule

  • P(A^c)=1-P(A)

Null set

  • P(Φ)=0

Addition rule of probability

  • P(A∪B)=P(A)+P(B)-P(A∩B)

Multiplication rule of probability

  • P(A∩B)=P(A)P(B|A)

Conditional probability

  • The probability of event A given that event B has occurred is denoted by P(A|B) is defined as P(A|B)=P(A∩B)/P(B)
Exercise 1

A cell phone company found that 75% of all the customers want text messaging on their phones, 80% want photo capability, and 65 % want both. What is the probability that a customer will want at lest one of these?

Exercise 2

In Exercise 1, we noted that a cell phone company found that 75% of all the customers want text messaging on their phones, 80% want photo capability, and 65 % want both.

What are the probability that a person who wants text messaging also wants photo capability? (P(A\|B))

What are the probability that a person who wants photo capability also wants text messaging?

Exercise 3 “How to Ask a sensitive question without biases”

Suppose a survey was carried out, in order to know the correct distributions of the answer to the following question.

Question A. Have you ever lied on an employee application?

But this question is quite sensitive, and for some reasons, we might expect that some respondents do not answer to the question honestly. To overcome this bias in the answers, the respondent is asked to flip a coin and answer to the question B instead of question A, if it is Tail.

Question B. Is the last digit of your My Number is odd?

And the survey enumerator does not know which question each respondent answers to. The last digit of your My Number is theoretically random. So we expect that the respondents answer “yes” with probability of 50% to Question B. As the results of the survey, a “yes” answer to this survey was 37%. What is the probability that a respondent who was answering to Question A replied “yes”? 

Hint:
We can define events in the survey as follow

E: Respondents answered yes

A: Respondents answered to Question A

B: Respondents answered to Question B

Then, we can write the probability of P(E) as P(E)=P(E|A)P(A)+P(E|B)P(B)

So, what is the probability of P(E|A)?

1.3 Discrete Probability

In the discrete probability, the probability of each event is defined as the ratio of the number of favorable outcomes to the total number of outcomes. For example, if we throw a dice, the probability of getting a 1 is 1/6.

Terminologies of Probability Theory
Important

Random variable

A variable that takes on numerical values realized by the outcomes in the sample space generated by random experiment.

  • We denote X as random variable

  • We denote lowercase X as realized values.

Discrete random variable

It can take on only a finite number of values.

  • e.g., the number of people in a class, the number of sales of cars

Continuous random variable

It can take any values in an interval

  • e.g., temperature, income of family, amount of sales.

1.4 Continuous Probability

In Cambodian commercial banks, total assets of each bank are following the exponential distribution (Data is from Supervision Annual Report).

library(readxl)
library(ggplot2)

file_path <- "Data/NBC_Bank.xlsx"
data <- read_excel(file_path, sheet = "Sheet1", range = "A3:DB756")

p <- ggplot(data, aes(x = total_asset)) + geom_histogram()
p 
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_bin()`).

But growth rates of total assets of each bank follows the normal distribution. Growth rate can be calculated as follow.

Growth Rate_t = \frac{Asset_t-Asset_{t-1}}{Asset_{t-1}}*100

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
data2 <- data %>%
  group_by(ID) %>%
  mutate(growth_rate = (total_asset - lag(total_asset)) / lag(total_asset) * 100) %>%
  filter(growth_rate <= 100)

p <- ggplot(data2, aes(x = growth_rate)) + geom_histogram()
p 
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Or if you take natural logarithm of the total asset, the distribution is going to follow natural distribution.

p <- ggplot(data, aes(x = log(total_asset))) + geom_histogram()
p 
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_bin()`).

Geometric growth rate and exponential growth rate

Geometric growth rateis defined as follow.

1+r = \frac{x_t}{x_{t-1}} (= e^{\rho})

The geometric growth rate is expressed r in the equation above . Furthermore, geometric growth rate can be expressed as e^{\rho}-1, where \rho is called as exponential growth rate. If we take natural logarithm of geometric growth rate, we obtain the exponetial growth rate as follow:

Exponetial Growth Rate_t = \rho= log(\frac{x_t}{x_{t-1}})

The exponential growth rate has transitivity as one of useful mathmatical property for analysis.

Transitivity is thatthe chained indices are identical to the corresponding direct indices. For example,

log(\frac{x_t}{x_{t-2}}) = log(\frac{x_t}{x_{t-1}}) + log(\frac{x_{t-1}}{x_{t-2}})

This means that the summation of monthly growth rates of the past 12 months becomes annual growth rate. (Unfortunately, simple summation of the geometric growth rate does not have this property). This property is useful and necessary for time series analysis such as analysis on inflation rate.

2 Sample Distribution

In this section, we are going to see how the sample statistics will be characterized using probability theory. Let the random variable ( X_1, X_2, \ldots, X_n ) denote a random sample from a population.

The sample mean is defined as follows:

\bar{X} = \frac{1}{n} (X_1 + X_2 + \ldots + X_n) = \frac{1}{n} \sum_{i=1}^{n} X_i

The mean and variance of the sample mean are as follows:

Mean of sample mean

E[\bar{X}] = E\left[\frac{1}{n} (X_1 + X_2 + \ldots + X_n)\right] = \frac{1}{n} (\mu + \mu + \ldots + \mu) = \mu

This result means that the sample mean is an unbiased estimator of the population mean.

Variance of sample mean

E[(\bar{X} - \mu)^2] = \text{Var}(\bar{X}) = \text{Var}\left(\frac{1}{n} X_1 + \ldots + \frac{1}{n} X_n\right) = n \cdot \frac{1}{n^2} \text{Var}(X_i) = \frac{\sigma^2}{n}

As variance of sample mean is \frac{\sigma^2}{n},meaning the variance will decrease as sample size N increases. This property of sample mean means that the sample mean represents the mean of a population more precisely if the sample size gets larger. We say “a sample mean (the estimator) converges in probability to the population mean (the population parameter)

Using R, I demonstrate how the sample mean behaves.I draw 10000 sample at sample size N from the population distribution N(50,10), and calculate the sample mean. The figures below shows the results of simulation for different sample size N. You can see that sample mean to represent the population mean more accurate as N increases.


Attaching package: 'gridExtra'
The following object is masked from 'package:dplyr':

    combine

library(ggplot2)
library(gridExtra)

number_of_samples <- 10000
sample_means <- numeric(number_of_samples)
sample_size <- 100  # Choose a number to see how central limit theorem works

mu <- 50
sigma <- 10

# Generate sample means
for (i in 1:number_of_samples) {
  sample <- rnorm(sample_size, mean = mu, sd = sigma)
  sample_means[i] <- mean(sample)
}

# Create histograms and density plots

# Create histograms and density plots
hist_plot <- ggplot(data = data.frame(sample_means), aes(x = sample_means)) +
  geom_histogram(bins = 36, color = "black", fill = "lightblue") +
  theme_minimal() +
  theme(axis.text = element_text(size = 14),
        axis.title = element_text(size = 20)) +
  labs(title = "Histogram of Sample Mean",
       x = "Sample Mean",
       y = "Count")

density_plot <- ggplot(data = data.frame(sample_means), aes(x = sample_means)) +
  geom_density(fill = "lightblue") +
  theme_minimal() +
  theme(axis.text = element_text(size = 14),
        axis.title = element_text(size = 20)) +
  labs(title = "Density of Sample Mean",
       x = "Sample Mean",
       y = "Density")

# Arrange the plots side by side
grid.arrange(hist_plot, density_plot, ncol = 2)

2.1 Central Limit Theorem

Let the random variable ( X_1, X_2, \ldots, X_n ) denote a random sample from a population with the mean ( \mu ) and variance ( \sigma^2 ). As n becomes large, the central limit theorem states that the distribution of Z=\frac{\sqrt{n}(\bar{X} - \mu)}{\sigma} approaches the standard normal distribution.

Z \xrightarrow{d} N(0,1)

In the figure below, we simulated to draw samples from the exponential distribution, and see how (\sqrt{n}(𝑋−\mu))/\sigma behaves. The figures below show the simulation results of the distribution of (\sqrt{n}(X −\mu))/\sigma at different sample size. If N increases, the distribution approaches N(0,1). The distribution approaches N(0,1), even though the population distribution is not the normal distribution

# Load necessary libraries
library(ggplot2)
library(gridExtra)

# Parameters
number_of_samples <- 10000
sample_means <- numeric(number_of_samples)

sample_size <- 1000  # Choose a number to see how central limit theorem works

mu <- 50
sigma <- 10

# Generate sample means
for (i in 1:number_of_samples) {
  sample <- rexp(sample_size, rate = 1/mu)
  sample_means[i] <- (mean(sample) - mu) / mu * sqrt(sample_size)
}

# Create histogram plot
hist_plot <- ggplot(data = data.frame(sample_means), aes(x = sample_means)) +
  geom_histogram(bins = 36, color = "black", fill = "lightblue") +
  theme_minimal() +
  theme(axis.text = element_text(size = 14),
        axis.title = element_text(size = 20)) +
  labs(title = "Histogram of Sample Mean",
       x = "Sample Mean",
       y = "Count")

# Create density plot
density_plot <- ggplot(data = data.frame(sample_means), aes(x = sample_means)) +
  geom_density(fill = "lightblue") +
  theme_minimal() +
  theme(axis.text = element_text(size = 14),
        axis.title = element_text(size = 20)) +
  labs(title = "Density of Sample Mean",
       x = "Sample Mean",
       y = "Density")

# Arrange the plots side by side
grid.arrange(hist_plot, density_plot, ncol = 2)

2.2 Confidence Interval

Using the central limit theorem, we can calculate an interval within which the mean of the population ( \mu_X ) is likely to exist. This interval is called the confidence interval.

In the case of calculating the (\alpha\% ) confidence interval, the lower value can be defined as:

\bar{X} - \frac{1}{\sqrt{n}} \sigma_X \cdot z_{\alpha}

The upper value can be defined as:

\bar{X} + \frac{1}{\sqrt{n}} \sigma_X \cdot z_{\alpha}

where z_{\alpha} is the (1-\alpha/2) quantile of the standard normal distribution.