2022-06-01

Probability

  • Most of the statistical functionalities in base R are collected in the stats package.

  • It provides simple functions which compute descriptive measures and facilitate computations involving a variety of probability distributions.

  • stats is part of the base distribution of R (installed by default). No need to run install.packages(“stats”) or library(“stats”). View library as follows:

library(help = "stats") 
  • Let’s refresh some core concepts of probability theory using R (drawing random numbers, how to compute densities, probabilities, quantiles and alike).

Random Variables

  • The mutually exclusive results of a random process are called the outcomes - only one of the possible outcomes can be observed.

  • Probability of an outcome is the proportion that the outcome occurs in the long run (if the experiment is repeated many times.)

  • The set of all possible outcomes of a random variable is called the sample space.

  • An event is a subset of the sample space and consists of one or more outcomes.

  • Random variables can be discrete (0 and 1) or continuous.

Probability Distributions (Discrete)

Example: rolling a dice once.

  • This is nothing but randomly selecting a sample of size 1 from a set of numbers which are mutually exclusive outcomes.

  • The sample space is {1,2,3,4,5,6}, and we can think of many different events, e.g., ’the observed outcome lies between 1 and 3.

  • Let’s role a dice once: ###

sample(1:6, 1) 
## [1] 2

Probability Distributions (Discrete)

  • The probability distribution of a discrete random variable is the list of all possible values of the variable and their probabilities which sum to 1.

  • The cumulative probability distribution function gives the probability that the random variable is less than or equal to a particular value.

Probability Distribution

# generate the vector of probabilities 
probability <- rep(1/6, 6) 
# plot the probabilities 
plot(probability, xlab = "outcomes", main = "Probability Distribution") 

Cumulative Probability Distribution

# generate the vector of cumulative probabilities 
cum_probability <- cumsum(probability) # cumulative sums of the vector probability
# plot the probabilites 
plot(cum_probability, xlab = "outcomes", main = "Cumulative Prob Dist") 

Continuous Random Variables

  • On the other hand, a continuous random variable can assume a continuum of values.

  • The probability density function (PDF) (f(y)) is a non-negative continuous function that the area under f(y) between any points a and b is the probability that Y assumes a value between a and b.

  • The total area under f(y) must be always 1.

The Normal Distribution

  • The probably most important probability distribution considered is the normal distribution.

  • Normal distributions are symmetric and bell-shaped. A normal distribution is characterized by its mean \(\mu\) and its standard deviation \(\sigma\) - \(N(\mu,\sigma)\).

  • The normal distribution has the PDF:

\[ f(x) = \frac{1}{\sqrt(2\pi\sigma)}exp-(x-\mu)^{2}/(2\sigma^{2}) \] - For the standard normal distribution we have \(\mu=0\) and \(\sigma =1\). Standard normal variates are often denoted by Z.

The Normal Distribution

# create a sequence
data <- seq(-3.5,3.5, length=100)
dt<-dnorm(data)
dt
##   [1] 0.000873 0.001115 0.001417 0.001793 0.002256 0.002826
##   [7] 0.003521 0.004365 0.005385 0.006610 0.008074 0.009812
##  [13] 0.011864 0.014275 0.017090 0.020358 0.024131 0.028459
##  [19] 0.033396 0.038995 0.045305 0.052373 0.060243 0.068949
##  [25] 0.078520 0.088974 0.100317 0.112541 0.125626 0.139533
##  [31] 0.154206 0.169572 0.185540 0.201999 0.218821 0.235862
##  [37] 0.252962 0.269949 0.286640 0.302845 0.318370 0.333023
##  [43] 0.346612 0.358957 0.369888 0.379250 0.386911 0.392758
##  [49] 0.396705 0.398693 0.398693 0.396705 0.392758 0.386911
##  [55] 0.379250 0.369888 0.358957 0.346612 0.333023 0.318370
##  [61] 0.302845 0.286640 0.269949 0.252962 0.235862 0.218821
##  [67] 0.201999 0.185540 0.169572 0.154206 0.139533 0.125626
##  [73] 0.112541 0.100317 0.088974 0.078520 0.068949 0.060243
##  [79] 0.052373 0.045305 0.038995 0.033396 0.028459 0.024131
##  [85] 0.020358 0.017090 0.014275 0.011864 0.009812 0.008074
##  [91] 0.006610 0.005385 0.004365 0.003521 0.002826 0.002256
##  [97] 0.001793 0.001417 0.001115 0.000873

The Normal Distribution

# draw a plot of the N(0,1) PDF
curve(dnorm(x),
      xlim = c(-3.5, 3.5),
      ylab = "Density", 
      main = "Standard Normal Density Function")

The Normal Distribution

# plot the standard normal CDF
curve(pnorm(x), 
      xlim = c(-3.5, 3.5), 
      ylab = "Probability", 
      main = "Standard Normal Cumulative Distribution Function")

More about normal distribution

Learn more about pnorm(), dnorm(), qnorm(), rnorm() website

Moments

  • Provide important summaries of various aspects of distributions
  • We have four important moments = mean, variance, skewness, and kurtosis.

Mean

  • For a discrete random variable, the expected value is computed as a weighted average of its possible outcomes whereby the weights are the related probabilities

  • The mean measures the location, or central tendency, of y.

  • In the dice example, the random variable, D say, takes on 6 possible values \[d_{1}=1,d_{2}=2, ..., d_{6}=6.\]

  • The expected value: \[ E(D) = \frac{1}{6} \sum_{i=1}^{6}d_{i} = 3.5\]

  • For continuous, it is similar but instead of summing over all possible values, we integrate.

Moments - Mean

  • An example of sampling with replacement is rolling a dice three times in a row.
# set seed for reproducibility
set.seed(1) # be able to get the same result - can be any number
# rolling a dice three times in a row
sample(1:6, 3, replace = T)
## [1] 1 4 1
# getting the mean of the sample
mean(sample(1:6,3, replace = T)) 
## [1] 3.33
  • As we increase the number of times we role the dice, its mean converge to the its expected value - try!!!

Moments - Variance

  • Other moment measures are the variance and the standard deviation. Both are measures of the dispersion of a random variable around its mean.

\[ \sigma^{2}_{Y} = Var(Y) = E[(Y-\mu_{Y})^{2}] = \sum_{i=1}^{k}(y_{i}-\mu_{y})^{2}p_{i} \] The standard deviation of Y is \(\sigma_{Y}\), the square root the variance. The units of the standard deviation are the same as the units of Y.

Moments - Skewness

  • The skewness of y is its expected cubed deviation from its mean:

\[ S = \frac{E(y-\mu)^{3}}{\sigma^{3}} \]

  • It measures the amount of asymmetry in a distribution. The larger the absolute size of it, the more asymmetrical is the distribution.

  • A large positive value indicate a long right tail, and large negative indicate a long left tail.

  • A zero value indicates symmetry around the mean (normal)

Moment - Kurtosis

  • The kurtosis of y is the expected fourth power (moment) of the deviation of y from its mean

\[ K= \frac{E(y-\mu)^{4}}{\sigma^{4}} \] - It measures the thickness of the tails of a distribution.

  • K>3 indicates ``fat tails’’ or leptokurtosis, relative to normal.

  • Extreme events are more likely to occur than would be the case under normality.

Statistics

  • So far we have reviewed aspects of a known population distribution of random variables.

  • However, most of time we have a sample of data drawn from an unknown population distribution, and we want to learn from the sample

\[ \{y_{t}\}_{t=1}^{T} \sim f(y) \]

  • To do so, we use various estimators

  • We can obtain estimators by replacing population expectations with sample averages

Sample Estimators

  • The sample average is simply the arithmetic average

\[ \bar{y} = \frac{1}{T}\sum_{t=1}^{T}y_{t} \]

  • It provides an empirical measure of the location of y

  • The sample variance is the average squared deviation from the sample mean:

\[ \sigma^{2} = \frac{\sum_{t=1}^{T}(y_{t}-\bar{y})^2}{T} \]

Sample Estimators

  • Sample variance:

\[ \hat{\sigma}^{2} = \frac{\sum_{t=1}^{T}(y_{t}-\bar{y})^2}{T} \] - we usually use a different version of \(\hat{\sigma}^{2}\), which corrects for the 1 degree of freedom used in the estimation of \(\bar{y}\), to estimate an unbiased estimator of \(\hat{\sigma}^{2}\):

\[ S^{2} = \frac{\sum_{t=1}^{T}(y_{t}-\bar{y})^2}{T-1} \]

Sample Estimators

  • The sample skewness is:

\[ \hat{S} = \frac{1}{N} \times \frac{\sum_{t=1}^{T}(y_{t}-\bar{y})^3}{\hat{\sigma}^{3}} \]

  • It provides an empirical measure of the amount of asymmetry in the distribution of y.

  • The sample kurtosis:

\[ \hat{K} = N \times \frac{\sum_{t=1}^{T}(y_{t}-\bar{y})^4}{\hat{\sigma}^{4}} \]

  • It provides an empirical measure of fatness of the tails of the distribution of y relative to a normal distribution.

Stocks Prices

# This package have contain daily historical stock prices
library(tidyquant)

data1<-FANG
head(data1)
## # A tibble: 6 × 8
##   symbol date        open  high   low close  volume adjusted
##   <chr>  <date>     <dbl> <dbl> <dbl> <dbl>   <dbl>    <dbl>
## 1 FB     2013-01-02  27.4  28.2  27.4  28    6.98e7     28  
## 2 FB     2013-01-03  27.9  28.5  27.6  27.8  6.31e7     27.8
## 3 FB     2013-01-04  28.0  28.9  27.8  28.8  7.27e7     28.8
## 4 FB     2013-01-07  28.7  29.8  28.6  29.4  8.38e7     29.4
## 5 FB     2013-01-08  29.5  29.6  28.9  29.1  4.59e7     29.1
## 6 FB     2013-01-09  29.7  30.6  29.5  30.6  1.05e8     30.6

Stocks Prices - Example

#Let's check the 4 moments of the FB stock
# first, filtering for FB stocks and looking for adjusted price
fb<-filter(FANG, symbol == "FB")%>%
  select(adjusted)

#mean
avg<-fb%>%
summarize(Mean = mean(adjusted))
avg
## # A tibble: 1 × 1
##    Mean
##   <dbl>
## 1  77.5
#Variance
vr<-fb%>%
summarize(Variance = var(adjusted))
vr
## # A tibble: 1 × 1
##   Variance
##      <dbl>
## 1     972.

Stocks Prices - Example

#Kurtosis
kut<-fb%>%
summarize(Kurtosis = kurtosis(adjusted))
kut
## # A tibble: 1 × 1
##   Kurtosis
##      <dbl>
## 1   -0.998
#skewness
skw<-fb%>%
summarize(Skewens = skewness(adjusted))
skw
## # A tibble: 1 × 1
##   Skewens
##     <dbl>
## 1  -0.115
  • Does FB stock prices follow a normal distribution?

Stocks Prices - Example

ggplot(fb, aes(x=adjusted)) + geom_histogram()