Statistics Review

2022-06-01

Probability

Most of the statistical functionalities in base R are collected in the stats package.
It provides simple functions which compute descriptive measures and facilitate computations involving a variety of probability distributions.
stats is part of the base distribution of R (installed by default). No need to run install.packages(“stats”) or library(“stats”). View library as follows:

library(help = "stats")

Let’s refresh some core concepts of probability theory using R (drawing random numbers, how to compute densities, probabilities, quantiles and alike).

Random Variables

The mutually exclusive results of a random process are called the outcomes - only one of the possible outcomes can be observed.
Probability of an outcome is the proportion that the outcome occurs in the long run (if the experiment is repeated many times.)
The set of all possible outcomes of a random variable is called the sample space.
An event is a subset of the sample space and consists of one or more outcomes.
Random variables can be discrete (0 and 1) or continuous.

Probability Distributions (Discrete)

Example: rolling a dice once.

This is nothing but randomly selecting a sample of size 1 from a set of numbers which are mutually exclusive outcomes.
The sample space is {1,2,3,4,5,6}, and we can think of many different events, e.g., ’the observed outcome lies between 1 and 3.
Let’s role a dice once: ###

sample(1:6, 1)

## [1] 2

Probability Distributions (Discrete)

The probability distribution of a discrete random variable is the list of all possible values of the variable and their probabilities which sum to 1.
The cumulative probability distribution function gives the probability that the random variable is less than or equal to a particular value.

Probability Distribution

# generate the vector of probabilities 
probability <- rep(1/6, 6) 
# plot the probabilities 
plot(probability, xlab = "outcomes", main = "Probability Distribution")

Cumulative Probability Distribution

# generate the vector of cumulative probabilities 
cum_probability <- cumsum(probability) # cumulative sums of the vector probability
# plot the probabilites 
plot(cum_probability, xlab = "outcomes", main = "Cumulative Prob Dist")

Continuous Random Variables

On the other hand, a continuous random variable can assume a continuum of values.
The probability density function (PDF) (f(y)) is a non-negative continuous function that the area under f(y) between any points a and b is the probability that Y assumes a value between a and b.
The total area under f(y) must be always 1.

The Normal Distribution

The probably most important probability distribution considered is the normal distribution.
Normal distributions are symmetric and bell-shaped. A normal distribution is characterized by its mean \(\mu\) and its standard deviation \(\sigma\) - \(N(\mu,\sigma)\).
The normal distribution has the PDF:

\[ f(x) = \frac{1}{\sqrt(2\pi\sigma)}exp-(x-\mu)^{2}/(2\sigma^{2}) \] - For the standard normal distribution we have \(\mu=0\) and \(\sigma =1\). Standard normal variates are often denoted by Z.

The Normal Distribution

# create a sequence
data <- seq(-3.5,3.5, length=100)
dt<-dnorm(data)
dt

##   [1] 0.000873 0.001115 0.001417 0.001793 0.002256 0.002826
##   [7] 0.003521 0.004365 0.005385 0.006610 0.008074 0.009812
##  [13] 0.011864 0.014275 0.017090 0.020358 0.024131 0.028459
##  [19] 0.033396 0.038995 0.045305 0.052373 0.060243 0.068949
##  [25] 0.078520 0.088974 0.100317 0.112541 0.125626 0.139533
##  [31] 0.154206 0.169572 0.185540 0.201999 0.218821 0.235862
##  [37] 0.252962 0.269949 0.286640 0.302845 0.318370 0.333023
##  [43] 0.346612 0.358957 0.369888 0.379250 0.386911 0.392758
##  [49] 0.396705 0.398693 0.398693 0.396705 0.392758 0.386911
##  [55] 0.379250 0.369888 0.358957 0.346612 0.333023 0.318370
##  [61] 0.302845 0.286640 0.269949 0.252962 0.235862 0.218821
##  [67] 0.201999 0.185540 0.169572 0.154206 0.139533 0.125626
##  [73] 0.112541 0.100317 0.088974 0.078520 0.068949 0.060243
##  [79] 0.052373 0.045305 0.038995 0.033396 0.028459 0.024131
##  [85] 0.020358 0.017090 0.014275 0.011864 0.009812 0.008074
##  [91] 0.006610 0.005385 0.004365 0.003521 0.002826 0.002256
##  [97] 0.001793 0.001417 0.001115 0.000873

The Normal Distribution

# draw a plot of the N(0,1) PDF
curve(dnorm(x),
      xlim = c(-3.5, 3.5),
      ylab = "Density", 
      main = "Standard Normal Density Function")

The Normal Distribution

# plot the standard normal CDF
curve(pnorm(x), 
      xlim = c(-3.5, 3.5), 
      ylab = "Probability", 
      main = "Standard Normal Cumulative Distribution Function")

More about normal distribution

Learn more about pnorm(), dnorm(), qnorm(), rnorm() website

Moments

Provide important summaries of various aspects of distributions
We have four important moments = mean, variance, skewness, and kurtosis.

Mean

For a discrete random variable, the expected value is computed as a weighted average of its possible outcomes whereby the weights are the related probabilities
The mean measures the location, or central tendency, of y.
In the dice example, the random variable, D say, takes on 6 possible values \[d_{1}=1,d_{2}=2, ..., d_{6}=6.\]
The expected value: \[ E(D) = \frac{1}{6} \sum_{i=1}^{6}d_{i} = 3.5\]
For continuous, it is similar but instead of summing over all possible values, we integrate.

Moments - Mean

An example of sampling with replacement is rolling a dice three times in a row.

# set seed for reproducibility
set.seed(1) # be able to get the same result - can be any number
# rolling a dice three times in a row
sample(1:6, 3, replace = T)

## [1] 1 4 1

# getting the mean of the sample
mean(sample(1:6,3, replace = T))

## [1] 3.33

As we increase the number of times we role the dice, its mean converge to the its expected value - try!!!

Moments - Variance

Other moment measures are the variance and the standard deviation. Both are measures of the dispersion of a random variable around its mean.

\[ \sigma^{2}_{Y} = Var(Y) = E[(Y-\mu_{Y})^{2}] = \sum_{i=1}^{k}(y_{i}-\mu_{y})^{2}p_{i} \] The standard deviation of Y is \(\sigma_{Y}\), the square root the variance. The units of the standard deviation are the same as the units of Y.

Moments - Skewness

The skewness of y is its expected cubed deviation from its mean:

\[ S = \frac{E(y-\mu)^{3}}{\sigma^{3}} \]

It measures the amount of asymmetry in a distribution. The larger the absolute size of it, the more asymmetrical is the distribution.
A large positive value indicate a long right tail, and large negative indicate a long left tail.
A zero value indicates symmetry around the mean (normal)

Moment - Kurtosis

The kurtosis of y is the expected fourth power (moment) of the deviation of y from its mean

\[ K= \frac{E(y-\mu)^{4}}{\sigma^{4}} \] - It measures the thickness of the tails of a distribution.

K>3 indicates ``fat tails’’ or leptokurtosis, relative to normal.
Extreme events are more likely to occur than would be the case under normality.

Statistics

So far we have reviewed aspects of a known population distribution of random variables.
However, most of time we have a sample of data drawn from an unknown population distribution, and we want to learn from the sample

\[ \{y_{t}\}_{t=1}^{T} \sim f(y) \]

To do so, we use various estimators
We can obtain estimators by replacing population expectations with sample averages

Sample Estimators

The sample average is simply the arithmetic average

\[ \bar{y} = \frac{1}{T}\sum_{t=1}^{T}y_{t} \]

It provides an empirical measure of the location of y
The sample variance is the average squared deviation from the sample mean:

\[ \sigma^{2} = \frac{\sum_{t=1}^{T}(y_{t}-\bar{y})^2}{T} \]

Sample Estimators

Sample variance:

\[ \hat{\sigma}^{2} = \frac{\sum_{t=1}^{T}(y_{t}-\bar{y})^2}{T} \] - we usually use a different version of \(\hat{\sigma}^{2}\), which corrects for the 1 degree of freedom used in the estimation of \(\bar{y}\), to estimate an unbiased estimator of \(\hat{\sigma}^{2}\):

\[ S^{2} = \frac{\sum_{t=1}^{T}(y_{t}-\bar{y})^2}{T-1} \]

Sample Estimators

The sample skewness is:

\[ \hat{S} = \frac{1}{N} \times \frac{\sum_{t=1}^{T}(y_{t}-\bar{y})^3}{\hat{\sigma}^{3}} \]

It provides an empirical measure of the amount of asymmetry in the distribution of y.
The sample kurtosis:

\[ \hat{K} = N \times \frac{\sum_{t=1}^{T}(y_{t}-\bar{y})^4}{\hat{\sigma}^{4}} \]

It provides an empirical measure of fatness of the tails of the distribution of y relative to a normal distribution.

Stocks Prices

# This package have contain daily historical stock prices
library(tidyquant)

data1<-FANG
head(data1)

## # A tibble: 6 × 8
##   symbol date        open  high   low close  volume adjusted
##   <chr>  <date>     <dbl> <dbl> <dbl> <dbl>   <dbl>    <dbl>
## 1 FB     2013-01-02  27.4  28.2  27.4  28    6.98e7     28  
## 2 FB     2013-01-03  27.9  28.5  27.6  27.8  6.31e7     27.8
## 3 FB     2013-01-04  28.0  28.9  27.8  28.8  7.27e7     28.8
## 4 FB     2013-01-07  28.7  29.8  28.6  29.4  8.38e7     29.4
## 5 FB     2013-01-08  29.5  29.6  28.9  29.1  4.59e7     29.1
## 6 FB     2013-01-09  29.7  30.6  29.5  30.6  1.05e8     30.6

Stocks Prices - Example

#Let's check the 4 moments of the FB stock
# first, filtering for FB stocks and looking for adjusted price
fb<-filter(FANG, symbol == "FB")%>%
  select(adjusted)

#mean
avg<-fb%>%
summarize(Mean = mean(adjusted))
avg

## # A tibble: 1 × 1
##    Mean
##   <dbl>
## 1  77.5

#Variance
vr<-fb%>%
summarize(Variance = var(adjusted))
vr

## # A tibble: 1 × 1
##   Variance
##      <dbl>
## 1     972.

Stocks Prices - Example

#Kurtosis
kut<-fb%>%
summarize(Kurtosis = kurtosis(adjusted))
kut

## # A tibble: 1 × 1
##   Kurtosis
##      <dbl>
## 1   -0.998

#skewness
skw<-fb%>%
summarize(Skewens = skewness(adjusted))
skw

## # A tibble: 1 × 1
##   Skewens
##     <dbl>
## 1  -0.115

Does FB stock prices follow a normal distribution?

Stocks Prices - Example

ggplot(fb, aes(x=adjusted)) + geom_histogram()