R02-STA1511

Random Variable

A Random Variable is a numerical description of the results of an experiment.

Discrete Random Variable

- A discrete random variable is a random variable with distinct or discontinous outcomes.

- A discrete random variable has a value of integers, can be finite or infinity.

- Example
  
  X: The number of car accidents in a city.
  
  Y: The number of customers who come to a bank
  

Continous Random Variable

- A continuous random variable can have any value (including decimal numbers) in in a certain interval.

- Example:
  
  X: The depth of drilling to find oil

  Y: The weight of a truck in a truck-weighing station

Distribution of Discrete Random Variable

1. Binomial Distribution

Random varaible \(X\) is distributed \(X∼b(n,p)\) with mean \(μ=np\) and variance \(σ^{2}=np(1−p)\) if \(X\) is the count of successful events in n identical and independent Bernoulli trials with constant probability of success \(p\).

  • R function dbinom(x, size, prob) is the probability of x successes in size trials when the probability of success is prob.

  • R function pbinom(q, size, prob, lower.tail) is the cumulative probability (lower.tail = TRUE for left tail, lower.tail = FALSE for right tail) of less than or equal to q successes.

  • R function rbinom(n, size, prob) returns n random numbers from the binomial distribution x~b(size,prob).

Example of use:

  1. What is the probability of <=5 heads in 10 coin flips where probability of heads is 0.3?
dbinom(5,size = 10, p = 0.3)+dbinom(4,size = 10, p = 0.3)+dbinom(3,size = 10, p = 0.3)+dbinom(2,size = 10, p = 0.3)+dbinom(1,size = 10, p = 0.3)+dbinom(0,size = 10, p = 0.3)
## [1] 0.952651
# exact
pbinom(q = 5, size = 10, p = 0.3, lower.tail = TRUE)
## [1] 0.952651
library(dplyr)
library(ggplot2)

data.frame(heads = 0:10, 
           pmf = dbinom(x = 0:10, size = 10, prob = 0.3),
           cdf = pbinom(q = 0:10, size = 10, prob = 0.3, 
                        lower.tail = TRUE)) %>%
  mutate(Heads = ifelse(heads <= 5, "<=5", "other")) %>%
ggplot(aes(x = factor(heads), y = cdf, fill = Heads)) +
  geom_col() +
  geom_text(
    aes(label = round(cdf,2), y = cdf + 0.01),
    position = position_dodge(0.9),
    size = 3,
    vjust = 0
  ) +
  labs(title = "Probability of X <= 5 successes.",
       subtitle = "b(10, .3)",
       x = "Successes (x)",
       y = "probability") 

2.What is the expected number and variance of heads in 25 coin flips where probability of heads is 0.3?

#Expected number : exact
p=0.3
n=25

E<-n*p
E
## [1] 7.5
#variance : exact
# Variance = n*p*q
q=1-0.3 #probability of failed

variance<-25 * 0.3 * (1 - 0.3)
variance
## [1] 5.25

2. Poisson Distribution

Random varaible \(X\) is distributed \(X∼P(λ)\) with mean \(μ=λ\) and variance \(σ^{2}=λ\) if \(X=x\) is the number of successes in \(n\) (many) trials when the probability of success \(λ/n\) is small.

  • R function dpois(x, lambda) is the probability of x successes in a period when the expected number of events is lambda.

  • R function ppois(q, lambda, lower.tail) is the cumulative probability (lower.tail = TRUE for left tail, lower.tail = FALSE for right tail) of less than or equal to q successes.

  • R function rpois(n, lambda) returns n random numbers from the Poisson distribution x ~ P(lambda).

  • R function qpois(p, lambda, lower.tail returns the value (quantile) at the specified cumulative probability (percentile) p.

Example of use:

What is the probability of making 2 to 4 sales in a week if the average sales rate is 3 per week?

# Using exact probability
dpois(x = 2, lambda = 3) +
  dpois(x = 3, lambda = 3) +
  dpois(x = 4, lambda = 3)
## [1] 0.616115
#using ppois
ppois(4, lambda = 3,lower.tail = TRUE)- ppois(2, lambda = 3,lower.tail = TRUE)+dpois(x = 2, lambda = 3)
## [1] 0.616115

Distribution of Continous Random Variable

1. The F Distribution

The F distribution has numerous applications. The F test is used in to test whether two distributions are equivalent \(H_{0}: \sigma_A^{2} = \sigma_B^{2}\).

Like the chi-square distribution, the F distribution contains only positive values and in nonsymmetrical. There is an F distribution for each degree of freedom associated with \(s_{A^{2}}\) and \(s_{B^{2}}\)

  • R function df(x, df1, df2) is the probability of F equalling x when the degrees of freedom are df1 and df2.

  • R function pf(q, df1, df2, lower.tail) is the cumulative probability (lower.tail = TRUE for left tail, lower.tail = FALSE for right tail) of less than or equal to value q.

  • R function qf(p, df1, df2, lower.tail) is the value of x at the qth percentile (lower.tail = TRUE). R function rf(n, df1, df2) returns n random numbers from the F distribution.

library(dplyr)
library(ggplot2)
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.1.3
data.frame(f = 0:1000 / 100) %>% 
           mutate(df_10_20 = df(x = f, df1 = 10, df2 = 20),
                  df_05_10 = df(x = f, df1 = 5, df2 = 10)) %>%
  gather(key = "df", value = "density", -f) %>%
ggplot() +
  geom_line(aes(x = f, y = density, color = df)) +
  labs(title = "F at Various Degrees of Freedom",
       x = "F",
       y = "Density") 

2. Chi-Square Distribution

The chi-squared distribution has numerous applications. The Chi-squared test of population variance tests the likelihood of a hypothesized population variance. The Chi-squared goodness of fit test tests goodness of fit in categorical data analysis, and the Chi-square test of independence tests independence.

  • R function dchisq(x, df) is the probability of χ2 equalling x when the degrees of freedom is df.

  • R function pchisq(q, sd, lower.tail) is the cumulative probability (lower.tail = TRUE for left tail, lower.tail = FALSE for right tail) of less than or equal to value q.

  • R function rchisq(n, df) returns n random numbers from the chi-square distribution.

  • R function qchisq(p, df, lower.tail) is the value of x at the qth percentile (lower.tail = TRUE).

library(dplyr)
library(ggplot2)
library(tidyr)

data.frame(chisq = 0:7000 / 100) %>% 
           mutate(df_05 = dchisq(x = chisq, df = 5),
                  df_15 = dchisq(x = chisq, df = 15),
                  df_30 = dchisq(x = chisq, df = 30)) %>%
  gather(key = "df", value = "density", -chisq) %>%
ggplot() +
  geom_line(aes(x = chisq, y = density, color = df)) +
  labs(title = "Chi-Square at Various Degrees of Freedom",
       x = "Chi-square",
       y = "Density") 

3. Normal Distribution

  • R function dnorm(x, mean, sd) is the probability of x when the mean is mean and the standard deviation is sd.

  • R function pnorm(q, mean, sd, lower.tail) is the cumulative probability (lower.tail = TRUE for left tail, lower.tail = FALSE for right tail) of less than or equal to value q.

  • R function rnorm(n, mean, sd) returns n random numbers from the normal distribution \(X\)~\(N(\mu, \sigma^{2})\).

  • R function qnorm(p, mean, sd, lower.tail) is the value of x at the qth percentile (lower.tail = TRUE).

Example of use:

  1. IQ scores are distributed \(X\)\(N(100,144)\). What is the probability a randomly selected person’s IQ is <90?
mean1 = 100
sd1 = sqrt(144)
x1 = 90

# exact
pnorm(q = x1, mean = mean1, sd = sd1, lower.tail = TRUE)
## [1] 0.2023284
  1. IQ scores are distributed \(X\)\(N(100,16^{2})\). What is the probability a randomly selected person’s IQ is between 92 and 114?
my_mean = 100
my_sd = 16
my_x_l = 92
my_x_h = 114
# exact
probab2<-pnorm(q = my_x_h, mean = my_mean, sd = my_sd, lower.tail = TRUE) -
  pnorm(q = my_x_l, mean = my_mean, sd = my_sd, lower.tail = TRUE)

probab2
## [1] 0.5006755

EXCERCISE

  1. Use this data (download here) to make box plot, pie chart, and histogram.

  2. Let y be a normal random variable with \(\mu=500\) and \(\sigma^{2}=100\) . Find the following probabilities:

  • P(500<y<665)

  • P(y>665)

  • P(y<500)

  1. Suppose the probability that a drug produces a certain side effect is p = 0.3% and n = 3000 patients in a clinical trial receive the drug. What is the probability 4 people experience the side effect?

  2. The average number of car accidents in a city per month is 5.

    • What is the probability that there will be at most 2 accidents for the next month?

    • What is the expected value of the number of accidents in 1 year?