Probability basics

Events

Probability is the likelihood of an outcome. e.g. {any combination of two dice}
Event is a collection of outcomes. e.g. {both dice show the same face}
The outcome space is all the possible outcomes. e.g. {all the possible outcomes die show}

Probability formulas

binomial variable

\[ P(E) = \frac{number\ of\ outcomes\ in\ E}{number\ of\ possible\ outcomes} \]

continuous varable

\[ P\begin{pmatrix}a\leq X \leq b \end{pmatrix} = \int_a^b f(x) dx \] Probabilities of continuous random variables (X) are the area under the curve. The probability of any value is always zero. when X = k,

\[ P\begin{pmatrix}X = k \end{pmatrix} = 0 \]

Calculation of probability (operations)

union probability, addition rule + \[ \begin{aligned} &P(A \cup B)=P(A)+P(B)-P(A \cap B) \\ &P(A \cup B)=P(A)+P(B) \quad \text { if } \mathrm{A} \text { and } \mathrm{B} \text { are mutually exclusive } \end{aligned} \] joint probability, multiple rule x \[ \begin{aligned} &P(A \cap B)=P(A \mid B) P(B)=P(B \mid A) P(A) \\ &P(A \cap B)=P(A) P(B) \quad \text { if } \mathrm{A} \text { and } \mathrm{B} \text { are independent } \end{aligned} \] Marginal Probability is without reference to any other event or events \[ P(A) or P(B) \]

conditional Probability \[ P(A\mid B)=\dfrac{P(A \: \cap\: B)}{P(B)} \] p-values are conditional probabilities.

Bayes’s theorem

multiple law \[ P(A \cap B) = P(A\ |\ B) P(B) = P(B\ |\ A) P(A) \]
bayes’s formula

\[ P(B_j\ |\ A) = \frac{P(A\ |\ B_j) P(B_j)}{P(A)}\ \]

law of total probability \(P(A)\)

\[ P(A) = P(A\ |\ B_1) P(B_1) + \cdots + P(A\ |\ B_n) P(B_n).\notag \]

Random variables and distribution functions

Random variable takes on different values determined by chance. we can use random variables’ mathematical (distribution) function to find their probability.

probability mass function (PMF, discrete), e.g. Binomial Distribution. upon some conditions are satisfied, the sampling distribution of the sample proportion is approximately normal.
Probability Density Function (PDF, continuous), e.g. normal distribution, t, chi-squre, f…
Cumulative Distribution Function (CDF).

Probability distribution

joint distribution, discrete variables \[ P\left(X=x_{i}, Y=y_{j}\right)=p_{i j}, i, j=1,2, \ldots \] Marginal distribution, discrete variables \[ P\left(X=x_{i}\right)=\sum_{j=1}^{\infty} p_{i j}=p_{i}, i=1,2, \ldots \] conditional probability, discrete variables \[ P\left(Y=y_{j} \mid X=x_{i}\right)=\frac{p_{i j}}{p_{i}}, j=1,2, \ldots \]

Conditional expectation

Conditional expectation is the mathematical expectation of a conditional distribution.

The discrete variable

\[ E(Y|X_i)=\sum_{i=1}^{N}{(Y_i|X_i)}\cdot p(Y_i|X_i) \]

The continuous variable

\[ E(Y|X)=\int{(Y|X)}\cdot g(Y|X)dY \]

expectation formula, discrete variable \[ \mu=E(X)=\sum x_if(x_i) \]

Conditional variance

variance formula, discrete variable \[ \sigma^2=\text{Var}(X)=\sum (x_i-\mu)^2f(x_i) \]

Conditional expectation and conditional variance exist and can be estimated by regression models.

Sampling

We make inferences about the population based on the sample (inference) after summarizing data (description).

The error is resulting from using a sample characteristic (statistic) to estimate a population characteristic (parameter).

Standard Error \[ SD(\bar{X})=SE(\bar{X})=\dfrac{\sigma}{\sqrt{n}} \]

Central limit theorem and law of large numbers

For a large sample size, x mean is approximately normally distributed, regardless of the distribution of the population one samples from. so, the population parameter can be estimated using the sample.
With large samples, The mean of the sampling distribution is very close to the population mean.

Confidence interval

The higher the confidence level, the wider the width of the interval and thus the poorer the precision.

\[ \text{point estimate }\pm M\times \hat{SE}(\text{estimate}) \] the margin of error \[ E=z_{\alpha/2}\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}} \]

Introduction to Hypothesis Testing

-Set up the hypotheses and decide on the significance level.

-Construct and calculate the test statistic.

-Calculate probability value (p-value).

-Make a decision and state an overall conclusion.

Probability R practice

pros: don’t need complicated probability theory, easy (simulation)
cons: hard to get the exact solution

Integrate

integrate(function(x){x^2},0,2)$value

## [1] 2.666667

Derivation

fxy = expression(2*x^2+y+3*x*y^2)
dxy = deriv(fxy, c("x", "y"), func = TRUE)
dxy(1,2)

## [1] 16
## attr(,"gradient")
##       x  y
## [1,] 16 13

Create random variables with specific distributions

 dnorm(0)# density at a number

## [1] 0.3989423

 pnorm(1.28)# cumulative possibility

## [1] 0.8997274

 qnorm(0.95)#  quantile

## [1] 1.644854

 rnorm(10)# random numbers

##  [1] -0.88088511  0.49039771  1.07386519  1.16713589 -1.01172581  0.70453443
##  [7] -0.84719449 -0.53398655  0.03326851  1.60801428

using covariance matrix to generate Gaussian multiple variables

 library(MASS)
 Sigma <- matrix(c(10,3,3,2),2,2)
 mvrnorm(n=20, rep(0, 2), Sigma)

##             [,1]        [,2]
##  [1,]  1.5187331  0.62621665
##  [2,]  1.1050862 -0.49826048
##  [3,] -1.9292320  0.04056535
##  [4,] -1.4664689  1.57938123
##  [5,]  0.2476428 -0.02949476
##  [6,]  4.1485181  1.22342349
##  [7,]  3.9925071  1.94807455
##  [8,] -1.2674697 -1.35126479
##  [9,] -1.6721104 -0.97817620
## [10,]  0.3848645 -0.19130117
## [11,]  1.3799726 -0.36814938
## [12,]  2.4352271  2.12898251
## [13,] -3.1106685 -0.95956219
## [14,]  1.3631895  1.59052657
## [15,] -1.6499431  0.22257692
## [16,]  3.2571223  0.80981086
## [17,]  2.1421184  0.96733243
## [18,] -3.5372718 -1.16167758
## [19,]  0.2839805  1.75396696
## [20,] -0.6402299 -2.44016838

Prob function

pnorm(1.96,    0,1)

## [1] 0.9750021

qnorm(0.025,    0,1)

## [1] -1.959964

pchisq(3.84,1,lower.tail=F)

## [1] 0.05004352

mean(rchisq(10000,1)>3.84) #simulation

## [1] 0.0482

Vector and operations

seq(1,10,  2)

## [1] 1 3 5 7 9

x=rep(1:3,6)
x

##  [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

y=rep(1:3, each = 6)
y

##  [1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3

x+y

##  [1] 2 3 4 2 3 4 3 4 5 3 4 5 4 5 6 4 5 6

x-y

##  [1]  0  1  2  0  1  2 -1  0  1 -1  0  1 -2 -1  0 -2 -1  0

x*y

##  [1] 1 2 3 1 2 3 2 4 6 2 4 6 3 6 9 3 6 9

x/y

##  [1] 1.0000000 2.0000000 3.0000000 1.0000000 2.0000000 3.0000000 0.5000000
##  [8] 1.0000000 1.5000000 0.5000000 1.0000000 1.5000000 0.3333333 0.6666667
## [15] 1.0000000 0.3333333 0.6666667 1.0000000

x%*%y

##      [,1]
## [1,]   72

Select and substitute elements of vector

x[c(2,3)]

## [1] 2 3

x[-1]

##  [1] 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

x[x>2]

## [1] 3 3 3 3 3 3

x[x==2]

## [1] 2 2 2 2 2 2

# substitute 
x[x == 2] <- 0.5
x

##  [1] 1.0 0.5 3.0 1.0 0.5 3.0 1.0 0.5 3.0 1.0 0.5 3.0 1.0 0.5 3.0 1.0 0.5 3.0

Matrix and operations

matrix(1:10,2,5)

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10

matrix(1:10,5,2)

##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10

a <- matrix(12:20,3,3)
a[2,]

## [1] 13 16 19

a[,2]

## [1] 15 16 17

a[-2,]

##      [,1] [,2] [,3]
## [1,]   12   15   18
## [2,]   14   17   20

a[,-2]

##      [,1] [,2]
## [1,]   12   18
## [2,]   13   19
## [3,]   14   20

a[2,1]=21
a

##      [,1] [,2] [,3]
## [1,]   12   15   18
## [2,]   21   16   19
## [3,]   14   17   20

Compute inverse, determinant and eigen values of matrix

a<-matrix(c(11,21,31,21,32,43,12,32,54),3,3)
solve(a)

##            [,1]      [,2]       [,3]
## [1,] -1.9775281  3.471910 -1.6179775
## [2,]  0.7977528 -1.247191  0.5617978
## [3,]  0.5000000 -1.000000  0.5000000

det(a)

## [1] -178

solve(a)*det(a)

##      [,1] [,2] [,3]
## [1,]  352 -618  288
## [2,] -142  222 -100
## [3,]  -89  178  -89

t(a)

##      [,1] [,2] [,3]
## [1,]   11   21   31
## [2,]   21   32   43
## [3,]   12   32   54

eigen(a)

## eigen() decomposition
## $values
## [1] 91.6892193  5.6541299 -0.3433491
## 
## $vectors
##            [,1]       [,2]       [,3]
## [1,] -0.2573423 -0.7530908 -0.9049786
## [2,] -0.5253459 -0.1712782  0.3538153
## [3,] -0.8110405  0.6352306  0.2362806

Dataframe

name<-c('A','B','C')
chinese<-c(92,96,95)
math<-c(86, 85, 92)
score<-data.frame(name, chinese, math)
score

##   name chinese math
## 1    A      92   86
## 2    B      96   85
## 3    C      95   92

score[2]

##   chinese
## 1      92
## 2      96
## 3      95

score$math

## [1] 86 85 92

Solve problems using simulation

for loop

sim<-10000
p<-numeric(sim)
# numeric=NULL
for (i in 1:sim){
  p[i]<-  abs(mean(rnorm(10,20,sqrt(3)))-mean(rnorm(15,20,sqrt(3))))<0.1
}

mean(p)

## [1] 0.1079

using replicate

mean(replicate(10000,abs(mean(rnorm(10,20,sqrt(3)))-mean(rnorm(15,20,sqrt(3))))<0.1))

## [1] 0.1175

using apply function

A<-matrix(rnorm(250000,  20,sqrt(3)),10000,25)
head(A)

##          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]
## [1,] 19.38996 17.17169 21.68928 21.53605 20.65358 22.17426 21.38341 21.84597
## [2,] 24.19254 16.41728 17.38523 19.59159 21.04825 21.72394 19.09154 20.05932
## [3,] 18.78697 19.26176 21.30118 17.42785 23.48094 21.43805 21.67202 19.02939
## [4,] 18.28735 20.88258 18.94152 22.03726 17.28934 19.28584 19.93725 21.10955
## [5,] 24.01023 23.40581 20.63495 17.79877 23.24974 20.12610 16.93461 20.70022
## [6,] 17.85909 20.47354 25.19677 21.99099 17.31272 19.55983 20.59263 19.42591
##          [,9]    [,10]    [,11]    [,12]    [,13]    [,14]    [,15]    [,16]
## [1,] 18.51713 20.84713 19.57668 21.89467 17.67141 21.88368 22.82527 20.65207
## [2,] 19.05246 15.62194 19.61946 21.59534 23.62048 18.71575 18.46270 19.78524
## [3,] 23.72749 20.49721 21.37010 22.58976 20.56298 19.15932 21.93338 21.54959
## [4,] 23.46802 22.11435 18.20004 19.95149 18.15870 22.65712 21.45752 19.10825
## [5,] 19.29525 19.80707 18.80652 18.82406 19.71545 18.61493 19.71839 18.63235
## [6,] 21.56731 20.30297 20.69190 17.73491 21.27104 16.61442 19.46909 20.05049
##         [,17]    [,18]    [,19]    [,20]    [,21]    [,22]    [,23]    [,24]
## [1,] 19.64516 19.75267 21.85729 20.35406 18.78100 21.15610 17.62467 23.93520
## [2,] 19.32782 18.94165 24.30256 20.28542 21.06995 19.14494 20.84978 20.83973
## [3,] 19.14005 18.49508 17.92618 19.37619 19.72210 19.60864 21.69689 20.09254
## [4,] 17.61147 21.91340 17.84243 19.13915 20.35991 21.18536 22.54633 20.92910
## [5,] 20.90751 18.38613 21.18468 18.76978 20.82689 22.03421 17.90602 19.40751
## [6,] 19.25169 16.92567 20.75857 17.92523 18.96401 18.55858 20.27221 20.00757
##         [,25]
## [1,] 21.18032
## [2,] 18.92239
## [3,] 21.01482
## [4,] 22.71097
## [5,] 20.41447
## [6,] 19.36075

f<-function(x)   {abs(mean(x[1:10])-mean(x[11:25]))}
# solve the mean by apply

mean(apply(A,1,f)>0.1)

## [1] 0.8859

using probability method

 pnorm(0.1,0,sqrt(0.5))-pnorm(-0.1,0,sqrt(0.5))

## [1] 0.1124629

Permutations and combinations

choose(10,2)

## [1] 45

# combn(10,2)
factorial(10)

## [1] 3628800

prod(1:10)

## [1] 3628800

Search value position in vector

a<-c(1,2,3,5,0,9)
which(a==min(a))

## [1] 5

sum(a)

## [1] 20

unique(a)

## [1] 1 2 3 5 0 9

length(a)

## [1] 6

min(a)

## [1] 0

max(a)

## [1] 9

all(c(3,4) %in% a)

## [1] FALSE

Solve directly and optimize

plot and find the range of solve

f<-function(x){x^2-exp(x)} 
uniroot(f,c(-0.8,-0.6))  $root

## [1] -0.7034781

f2<-function(x){abs(x^2-exp(x))} 
optimize   (f2,lower=-0.8,upper=-0.6)$minimum

## [1] -0.703479

Calculate probability using simulation method

questions: randomly select 3 numbers out of 1:10, the sum is 9.

badge<-1:10 
sim<-10000 
p<-numeric(sim) 

for (i in 1:sim){ a<-sample(badge,3,replace=F) 
p[i]<-sum(a)==9 }

mean(p)

## [1] 0.0278

questions: eat three flavors tangyuan.

Tangyuan<-c(rep('A',8),rep('B',8),rep('C',8))
sim<-10000
p<-numeric(sim)
# how to do it according to the condition
for (i in 1:sim){
  a<-sample(Tangyuan,24,replace=F)
  p[i]<-(length(unique(a[1:6]))==3)&(length(unique(a[7:12]))==3)&(length(unique(a[13:18]))==3)&(length(unique(a[19:24]))==3)
}
mean(p)

## [1] 0.4916

question: select 2 balls when they are the same color.

box1<-c(rep('white',5), rep("black",11), rep('red',8))
box2<-c(rep('white',10), rep("black",8), rep('red',6))
sim<-10000
p<-numeric(sim)

for (i in 1:sim){
  a<-sample(box1, 1)
  b<-sample(box2, 1)
  p[i]<- a==b
}
mean(p)

## [1] 0.3242

select after putting them back

box<-c(rep("white",4),rep("red",2))
sim<-10000
t<-numeric(sim)

for (i in 1:sim){
  a<-sample(box, 2 ,replace=T)
  #  there are two white balls
  t[i]<-length(a[a=="white"])==2
}
mean(t)

## [1] 0.4539

question: two students have the same birthday out of 30 students

n<-30
sim<-10000
t<-numeric(sim)

for (i in 1:sim){
  a<-sample(1:365, n, replace=T)
  t[i]<-n-length(unique(a))
}
1-mean(t==0)

## [1] 0.709

#  probability
1-prod(365:(365-30+1))/365^30

## [1] 0.7063162

An event is a set of outcomes. You can describe certain events using random variables (x, distribution). the distribution of random variable function.

Discrete random variable

question: choose correct one out of four answers

x<-0:5
y<-dbinom(x,5,1/4)
plot(x,y,col=2,type='h')

using plot the probability of shooting is 0.02, what is the most likelihood of hit with 400 shootings

k<-0:400
p<-dbinom(k,400,0.02)
plot(k,p,type='h',col=2)

plot(k,p,type='h',col=2,xlim=c(0,20))

dbinom(7,400,0.02)

## [1] 0.1406443

dbinom(8,400,0.02)

## [1] 0.1410031

Exponent distribution

question: lifetime of a light (lamda=1/2000)

integrate(dexp,rate=1/2000,1000,Inf)$value

## [1] 0.6065307

f<-function(x){dexp(x,rate=1/2000)}
integrate(f,1000,Inf) $value

## [1] 0.6065307

1-pexp(1000,rate=1/2000)

## [1] 0.6065307

mean(rexp(10000,rate=1/2000)>1000)

## [1] 0.6059

Normal distribution plot

continue variable

x<-seq(-3,3,0.01)
plot(x, dnorm(x,mean=0, sd=2),type="l",xlab="x",ylab = "f(x)", col=1,lwd=2,ylim=c(0,1))   #density function

lines(x, dnorm(x,mean=0, sd=1),lty=2, col=2,lwd=2)

lines(x, dnorm(x,mean=0, sd=0.5), lty=3,col=3,lwd=2)

exbeta<-c(expression(paste(mu,"=0,", sigma,"=2")), expression(paste(mu,"=0,",sigma,"=1")), expression(paste(mu,"=0,", sigma,"=0.5")))
legend("topright", exbeta, lty = c(1, 2,3),col=c(1,2,3),lwd=2)

x<-seq(-3,3,0.01)
plot(x, dnorm(x,mean=-1, sd=1),type="l",xlab="x",ylab = "f(x)", col=1,lwd=2,ylim=c(0,0.6))
lines(x, dnorm(x,mean=0, sd=1),lty=2, col=2,lwd=2)
lines(x, dnorm(x,mean=1, sd=1), lty=3,col=3,lwd=2)
exbeta<-c(expression(paste(mu,"=-1,", sigma,"=1")), expression(paste(mu,"=0,",sigma,"=1")), expression(paste(mu,"=1,", sigma,"=1")))
legend("topright", exbeta, lty = c(1, 2,3),col=c(1,2,3),lwd=2)

question: solve sigma using nomoral distribution

sigma<-1
repeat{
sigma<-sigma+0.01
if (pnorm(200,160,sigma)-pnorm(120,160,sigma)<0.80) break
}
sigma

## [1] 31.22

# alternative 
sigma<-1
while( pnorm(200,160,sigma)-pnorm(120,160,sigma)>=0.80){sigma<-sigma+0.01}
sigma

## [1] 31.22

Distribution of random variable function

qestion: x^2 and 2x distributions

x<-c(-1,0,1,2,2.5)
weight<-c(0.2,0.1,0.1,0.3,0.3)
toss<-sample(x,10000,replace=T,weight)
table(toss^2)/length(toss^2)

## 
##      0      1      4   6.25 
## 0.0981 0.2995 0.2998 0.3026

table(2*toss)/length(2*toss)

## 
##     -2      0      2      4      5 
## 0.2031 0.0981 0.0964 0.2998 0.3026

quetsion: continous vairable density

x <- seq(0,5,0.01)
truth<-rep(0,length(x))
truth[0<=x&x<1]<-2/3
truth[1<=x&x<2]<-1/3

plot(density(abs(runif(1000000,-1,2))),main=NA, ylim=c(0,1),lwd=3,lty=3)

lines(x,truth,col="red",lwd=2)
legend("topright",c("True Density","Estimated Density"), col=c("red","black"),lwd=3,lty=c(1,3))

Join and margin probability

question: x is randomly selected from 1:4, y randomly select from x

p<-function(x,y) {
sim<-10000
t<-numeric(sim)
for (i in 1:sim) {
a<-sample(1:4,1)
b<-sample(1:a,1)
t[i]<-(a==x)&(b==y) }
mean(t)
}
PF<-matrix(0,4,4)
for (i in 1:4) {
for (j in 1:4) {
PF[i,j]<-p(i, j) } }
PF

##        [,1]   [,2]   [,3]   [,4]
## [1,] 0.2412 0.0000 0.0000 0.0000
## [2,] 0.1264 0.1267 0.0000 0.0000
## [3,] 0.0893 0.0827 0.0844 0.0000
## [4,] 0.0629 0.0645 0.0612 0.0594

apply(PF,1,sum)

## [1] 0.2412 0.2531 0.2564 0.2480

apply(PF,2,sum)

## [1] 0.5198 0.2739 0.1456 0.0594

Multiple random variables plots

2 discrete variables distribution

x<-sample(1:4, 10000, replace=T, prob=c(1/4, 1/4, 1/4, 1/4))
y<-numeric(10000)
for(i in 1:10000) {
if(x[i]==1) {y[i]<-sample(1:4,1,replace=T, prob=c(1,0,0,0))}
if(x[i]==2) {y[i]<-sample(1:4,1,replace=T, prob=c(1/2,1/2,0,0))}
if(x[i]==3) {y[i]<-sample(1:4,1,replace=T, prob=c(1/3,1/3,1/3,0))}
if(x[i]==4) {y[i]<-sample(1:4,1,replace=T, prob=c(1/4,1/4,1/4,1/4))}
}
z1<-x+y
table(z1)/length(z1)

## z1
##      2      3      4      5      6      7      8 
## 0.2463 0.1292 0.2081 0.1469 0.1471 0.0622 0.0602

z2<-x*y
table(z2)/length(z2)

## z2
##      1      2      3      4      6      8      9     12     16 
## 0.2463 0.1292 0.0815 0.1894 0.0841 0.0632 0.0839 0.0622 0.0602

z3<-pmax(x,y)
table(z3)/length(z3)

## z3
##      1      2      3      4 
## 0.2463 0.2558 0.2495 0.2484

z4<-x/y
table(z4)/length(z4)

## z4
##                1 1.33333333333333              1.5                2 
##           0.5170           0.0622           0.0841           0.1924 
##                3                4 
##           0.0815           0.0628

two normal distributions X～N(0,1),Y～N(0,1), Z=X+Y, therefre Z~N(0,2)

Z<-function(n){
x<-seq(-4,4,0.01)
truth<-dnorm(x,0,sqrt(2))

plot(density(rnorm(n)+rnorm(n)),main="Density Estimate of the Normal Addition Model",ylim=c(0,0.4),lwd=2,lty=2)

lines(x,truth,col="red",lwd=2)
legend("topright",c("True","Estimated"),col=c("red","black"),lwd=2,lty=c(1,2))
}

Z(10000)

Generate a circle using simulated random dots

D={(x,y)|x^2 + y^2 <= 1}

x<-runif(10000,-1,1)
y<-runif(10000,-1,1)

a<-x[x^2+y^2<=1]
b<-y[x^2+y^2<=1]
plot(a,b,col=4)

oval

a<-3
b<-1
x<-runif(10000,-a,a)
y<-runif(10000,-b,b)
x1<-x[x^2/a^2+y^2/b^2<=1]
y1<-y[x^2/a^2+y^2/b^2<=1]
plot(x1,y1,col=3)

Expectation

discrete variable question: the benefit of products is different.

sim<-10000
t<-numeric(sim)

for (i in 1:sim) {
Y<-1500
X<-rexp(1,rate=1/10)

Y[1<X&X<=2]<-2000
Y[2<X&X<=3]<-2500
Y[3<X]<-3000
t[i]<-Y
}

mean(t)

## [1] 2731.7

continue variable

Central Limit Theorem

###Central Limit Theorem for Expotential distribution
layout(matrix(c(1,3,2,4 ),ncol=2))
r<-1000
lambda<-1/100

for (n in c(1,5,10,30)){
mu<-1/lambda
xbar<-numeric(r)
sxbar<-1/(sqrt(n)*lambda)

for(i in 1:r){
xbar[i]<-mean(rexp(n,rate=lambda))
}

hist(xbar,prob=T,main=paste('SampDist.Xbar,n=',n),col=gray(.8))
Npdf<-dnorm(seq(mu-3*sxbar,mu+3*sxbar,0.01),mu,sxbar)
lines(seq(mu-3*sxbar,mu+3*sxbar,0.01),Npdf,lty=2,col=2)
box()
}

#####The central limit theorem for uniform distribution
layout(matrix(c(1,3,2,4),ncol=2))
r<-10000
mu<-5
sigma<-10/sqrt(12)
for (n in c(1,5,10,30)){
xbar<-numeric(r)
sxbar<-sigma/sqrt(n)
for (i in 1:r){
xbar[i]<-mean(runif(n,0,10))}
hist(xbar,prob=T,main=paste('SampDist.Xbar,n=',n),col=gray(0.8),ylim=c(0,1/(sqrt(1*pi)*sxbar)))
XX<-seq(mu-3*sxbar,mu+3*sxbar,0.01)
Npdf<-dnorm(XX,mu,sxbar)
lines(XX,Npdf,lty=2,col=2)
box()}

Law of large numbers

N <- 5000
set.seed(123) 

x <- sample(1:10, N,    replace = T)
s <- cumsum(x)    
r.avg <- s/(1:N)

options(scipen = 10)  

plot(r.avg, ylim=c(1, 10), type = "l", xlab = "Observations"
     ,ylab = "Probability", lwd = 2)
lines(c(0,N), c(5.5,5.5),col="red", lwd = 2)

Empirical distribution

x<-c(-2,-1.2,1.5,2.3,3.5)
plot(ecdf(x),col=2)
abline(v=0,col=3)

question: three numbers are from N(2,9), and what is the prob of their mean >3?

A<-matrix(rnorm(30000,2,3),10000,3)
mean(apply(A,1,mean)>3)

## [1] 0.2789

Maximum likelihood estimate

question: eatimate mean and variance

sample<-c(1.38, 3.96, -0.16, 8.12, 6.30, 2.61, -1.35, 0.03, 3.94, 1.11)
n<-length(sample)
muhat<-mean(sample)
sigsqhat<-sum((sample-muhat)^2)/n
muhat

## [1] 2.594

sigsqhat

## [1] 8.133884

loglike<-function(theta){
a<--n/2*log(2*pi)-n/2*log(theta[2])-sum((sample-theta[1])^2)/(2*theta[2])
return(-a)
}
optim(c(2,2),loglike,method="BFGS")$par

## [1] 2.593942 8.130340

t distribution, F distribution plots, and common distributions

n<-30
x<-seq(-6,6,0.01)
y<-seq(-6,6,1)

Truth<-df(x,1,n)

plot(density(rt(10000,n)^2),main="PDF",ylim=c(0,1),lty=2,xlim=c(-6,6)) #simulation 

lines(x, dt(x,n), col=3) #t dist
 
lines(x, dchisq(x,2), col=4) #chisq dist
 
lines(x,Truth,col=2) #f dist
abline (v=0 ,col=7)  

 
points(y,dbinom(y, size=12, prob=0.2),col=1) #binomial dist
points(y,dpois(y, 6),col=2) #poisson dist

lines(y,dunif(-6:6,min=-6,max=6 ),col=5) #uniform

Probability basics

Daniel

2022-05-20