Probability Distributions

A probability distribution describes how the values of a random variable is distributed. For example, the collection of all possible outcomes of a sequence of coin tossing is known to follow the binomial distribution. Whereas the means of sufficiently large samples of a data population are known to resemble the normal distribution. Since the characteristics of these theoretical distributions are well understood, they can be used to make statistical inferences on the entire data population as a whole.

Probability Density Functions

For a discrete distribution, the probability density function (pdf) is the relative likelihood that the continuous random variable X will have some speci ed value (k). Since for continuous distributions the probability at a single point is infnitesemally small, and usually treated as zero, it is not equivalent to P(X = k).Rather, the density function is the height of a density curve (such as the well-known Gaussian bell-curve) at point k.

Some of the more common probability distributions available in R are given below.

fx4

The functions available for each distribution follow this format:

fx5

For every distribution there are four commands. The commands for each distribution are prepended with a letter to indicate the functionality:

To get a full list of the distributions available in R you can use the following command:

help(Distributions)

Normal Distribution basics

The normal distribution is defined by the following probability density function, where \(\mu\) is the population mean and \(\sigma^2\) is the variance.

fx1

If a random variable X follows the normal distribution, then we write:

fx2

In particular, the normal distribution with \(\mu = 0\) and \(\sigma = 1\) is called the standard normal distribution, and is denoted as \(\text~\)N(0,1). It can be graphed as follows.

fx3

The normal distribution is important because of the Central Limit Theorem, which states that the population of all possible samples of size n from a population with mean \(\mu\) and variance \(\sigma^2\) approaches a normal distribution with mean \(\mu\) and \(\sigma^2/n\) when n approaches infinity.

We look at some of the basic operations associated with normal probability distributions.

There are four functions that can be used to generate the values associated with the normal distribution. You can get a full list of them and their options using the help command:

help(Normal)

The first function we look at it is dnorm. Given a set of values it returns the height of the probability distribution at each point (Calculates the normal probability density at x (can be a vector). If you only give the points it assumes you want to use a mean of zero and standard deviation of one. Useful for plotting the normalcurve over the histogram. There are options to use different values for the mean and standard deviation, though:dnorm(x,mean,sd)

dnorm(0)
## [1] 0.3989423
dnorm(0)*sqrt(2*pi)
## [1] 1
dnorm(0,mean=4)
## [1] 0.0001338302
dnorm(0,mean=4,sd=10)
## [1] 0.03682701
v <- c(0,1,2)
dnorm(v)
## [1] 0.39894228 0.24197072 0.05399097
#Plotting the Bell Curve
x = seq(-5,5,by=.01)
y = dnorm(x)
plot(x,y,type="l",lwd=2,col="red")
abline(h=0)
abline(v=0)

#plotting the density function of N(2,0.25)
x<-seq(0,4,0.1)
plot(x,dnorm(x,2,0.25),type="l",lwd=3,col="darkgreen")

y <- dnorm(x,mean=2.5,sd=0.1)
plot(x,y,pch=19)

The second function we examine is pnorm–cumulative densities. Given a number or a list it computes the probability that a normally distributed random number will be less than that number which baiscally means that it calculates the area under the normal curve below q. This function also goes by the rather ominous title of the “Cumulative Distribution Function (CDF).” :pnorm(q,mean,sd)

pnorm(0)
## [1] 0.5
pnorm(1)
## [1] 0.8413447
pnorm(0,mean=2)
## [1] 0.02275013
pnorm(0,mean=2,sd=3)
## [1] 0.2524925
v <- c(0,1,2)
pnorm(v)
## [1] 0.5000000 0.8413447 0.9772499
#right tail p value 1-pnorm(z)---rejection area
1-pnorm(1.96)  #or 
## [1] 0.0249979
1-pnorm(1.96,mean=0,sd=1) # same as above or
## [1] 0.0249979
pnorm(1.96,mean=0,sd=1,lower.tail = FALSE) #  probabilities are P[X > x]same as above
## [1] 0.0249979
# to get the non rejection region
pnorm(1.96,mean=0,sd=1) # probabilities are P[X ??? x]
## [1] 0.9750021
#?pnorm
#similarly for the lower tail p values pnorm(-z)
pnorm(-1.96)
## [1] 0.0249979
pnorm(-1.96,mean=0,sd=1) # same as above 
## [1] 0.0249979
x <- seq(-20,20,by=.1)
y <- pnorm(x)
plot(x,y)

y <- pnorm(x,mean=3,sd=4)
plot(x,y)

  • let’s see some examples of the pnorm()

Let X be the mass of the horses, distributed as: \[X \text~ N(1000,50^2)\]

#P(X<=975) : 
pnorm(975,mean=1000,sd=50)
## [1] 0.3085375
#P(X>=975)=1-P(X<=975):
1-pnorm(975,mean=1000,sd=50)
## [1] 0.6914625
#P(940<=X<=1030) = P(X<=1030) - P(X<=940):
pnorm(1030,mean=1000,sd=50) -pnorm(940,mean=1000,sd=50)
## [1] 0.6106772
#P(X>=1030) 
pnorm(1030,mean=1000,sd=50,lower=FALSE)
## [1] 0.2742531

The next function we look at is qnorm which is the quantile function–inverse of pnorm. The idea behind qnorm is that you give it a probability, and it returns the number whose cumulative distribution matches the probability. For example, if you have a normally distributed random variable with mean zero and standard deviation one, then if you give the function a probability it returns the associated Z-score:qnorm(prob,mean,sd)

qnorm(0.5)
## [1] 0
qnorm(0.5,mean=1)
## [1] 1
qnorm(0.5,mean=1,sd=2)
## [1] 1
qnorm(0.5,mean=2,sd=2)
## [1] 2
qnorm(0.5,mean=2,sd=4)
## [1] 2
qnorm(0.75,mean=5,sd=2)
## [1] 6.34898
#to claculate the critical quantile values
#at 95%-two tailed tests
qnorm(0.975,mean=0,sd=1)
## [1] 1.959964
#one tailed on right
qnorm(0.95,mean=0,sd=1)
## [1] 1.644854
#left one tailed
qnorm(0.05,mean=0,sd=1)
## [1] -1.644854
#to claculate the 95% Confidence Interval
qnorm(c(0.025,0.975))
## [1] -1.959964  1.959964
x <- seq(0,1,by=.05)
y <- qnorm(x)
plot(x,y,pch=19,col="darkred")

y <- qnorm(x,mean=3,sd=2)
plot(x,y,pch=19,col="darkblue")

y <- qnorm(x,mean=3,sd=0.1)
plot(x,y,pch=19,col="darkorange")

The last function we examine is the rnorm function–randomdeviates, which can generate random numbers whose distribution is normal. The argument that you give it is the number of random numbers that you want, and it has optional arguments to specify the mean and standard deviation:rnorm(n,mean,sd) where n is sample size

rnorm(4)
## [1] -1.5066905  0.7173715  1.9380778  1.6004275
rnorm(4,mean=3)
## [1] 3.633810 2.234031 3.206232 3.102865
rnorm(4,mean=3,sd=3)
## [1] 1.508864 2.507345 4.042416 5.481711
y <- rnorm(200)
hist(y,col="steelblue")

y <- rnorm(200,mean=-2)
hist(y,col="gold")

y <- rnorm(200,mean=-2,sd=4)
hist(y,col="salmon")

#plotting the Q-Q plot
qqnorm(y)
qqline(y)

Fun exercise:

With these functions, I can do some fun plotting. I created a sequence of values from 4 to 4, and then calculate both the standard normal PDF and the CDF of each of those values. I also generated 1000 random draws from the standard normal distribution. I then plotted these next to each other. Whenever you use probability functions, you should, as a habit, remember to set the seed. Setting the seed means locking in the sequence of “random” (they are pseudorandom) numbers that R gives you, so you can reproduce your work later on.

set.seed(3000)
xseq<-seq(-4,4,.01)
densities<-dnorm(xseq, 0,1)
cumulative<-pnorm(xseq, 0, 1)
randomdeviates<-rnorm(1000,0,1)
 
par(mfrow=c(1,3), mar=c(3,4,4,2))

plot(xseq, densities, col="darkgreen",xlab="", ylab="Density", type="l",lwd=2, cex=2, main="PDF of Standard Normal", cex.axis=.8)

plot(xseq, cumulative, col="darkorange", xlab="", ylab="Cumulative Probability",type="l",lwd=2, cex=2, main="CDF of Standard Normal", cex.axis=.8)

hist(randomdeviates, main="Random draws from Std Normal", cex.axis=.8, xlim=c(-4,4))

#I can plot a histogram of y and add a curve() function to the plot using the mean and standard deviation of y as the parameters:
par(mfrow=c(1,1), mar=c(5.1,4.1,4.1,2.1))# making the plot window to default size
xseq<-seq(-4,4,.01)
y<-2*xseq + rnorm(length(xseq),0,5.5)
hist(y, prob=TRUE, ylim=c(0,.06), breaks=20,col="orange")
curve(dnorm(x, mean(y), sd(y)), add=TRUE, col="darkblue", lwd=2)

Plotting the critical Regions

#One-tailed test
xval <- seq(-3.2, 3.2, length = 1000)
yval<- dnorm(xval)

plot(xval, yval, type = "l", axes = TRUE, frame = FALSE, lwd = 3, xlab = "", ylab = "")
x <- seq(qnorm(.95), 3.2, length = 100)
polygon(c(x, rev(x)),c(dnorm(x), rep(0, length(x))), col = "salmon")
#?polygon
text(mean(x), mean(dnorm(x))+.02, "5%", cex = 1)
text(qnorm(.95), .01, "1.645", cex = 1)

#two tailed

val <- seq(-3.2, 3.2, length = 1000)
yval<- dnorm(xval)
plot(xval, yval, type = "l", axes = TRUE, frame = FALSE, lwd = 3, xlab = "", ylab = "")
x <- seq(qnorm(.975), 3.2, length = 100)
polygon(c(x, rev(x)),c(dnorm(x), rep(0, length(x))), col = "salmon")
text(mean(x), mean(dnorm(x))+.025, "2.5%", cex = 1)
text(qnorm(.975)-0.1, .01, "1.96", cex = 1)
x <- seq(-3.2, qnorm(.025),length = 100)
polygon(c(x, rev(x)),c(dnorm(x), rep(0, length(x))), col = "salmon")
text(mean(x), mean(dnorm(x))+.025, "2.5%", cex = 1)
text(qnorm(.025)+0.06, .015, "-1.96", cex = 1)
text(0, dnorm(0) / 5, "95%", cex = 1)

Visualisation of one tailed and two tailed tests:

#one tailed test plotting
bt<-seq(60,120,1)
plot(bt,dnorm(bt,90,10),type="l",main="One tailed test")
#pnorm(72,90,10) = 0.03593032
abline(v=72)
cord.x<-c(60,seq(60,72,1),72)
cord.y<-c(0,dnorm(seq(60,72,1),90,10),0)
polygon(cord.x,cord.y,col="skyblue")
text(65,0.005,"blue area is p=0.0359",cex=0.7)

#Two tailed test plotting
bt<-seq(60,120,1)
plot(bt,dnorm(bt,90,10),type="l",main="Two tailed test")
#pnorm(72,90,10) =0.03593032
abline(v=72)
abline(v=108)
cord.x<-c(60,seq(60,72,1),72)
cord.y<-c(0,dnorm(seq(60,72,1),90,10),0)
polygon(cord.x,cord.y,col="skyblue")
cord.x1<-c(108,seq(108,120,1),120)
cord.y1<-c(0,dnorm(seq(108,120,1),90,10),0)
polygon(cord.x1,cord.y1,col="skyblue")
cord.x2<-c(72,seq(72,108,1),108)
cord.y2<-c(0,dnorm(seq(72,108,1),90,10),0)
polygon(cord.x2,cord.y2,col="red")
text(64.5,0.005,paste("p=",round(pnorm(72,90,10),3)))
text(116,0.005,paste("p=",round(pnorm(72,90,10),3)))
text(90,0.02,"p=0.928")


Assignment 3

Q: 1) Plot a histogram with a density line for a sample generated out of a normal distribution with mean= 70 and standard deviation = 3 ; where size = 10

Show that as sample size increases (from 10 to 100 to 1000, 10000), mean of the sample approaches population mean. Show all the plots in same graph (hint : use mfrow command)

par(mfrow=c(2,2))
set.seed(007)  #We set the seed so that we get the same result everytime we run the code

s1<-rnorm(10, 70, 3)  #here we are sampling  a size of 10 from randomly distributed population of mean 70 and sd of 3
hist(s1,prob=T,col = "steelblue")
lines(density(s1),col="red",lwd=2)
abline(v=mean(s1),col="darkgreen",lwd=3) #drawing the mean as a green line of the sample
text(75,0.18,paste("N=10 mean =", 
                  round(mean(s1), 1)),cex=0.7)

s2<-rnorm(100, 70, 3) 
hist(s2,prob=T,col = "steelblue")
lines(density(s2),col="red",lwd=2)
abline(v=mean(s2),col="darkgreen",lwd=3)
text(76,0.12,paste("N=100 mean =", 
                   round(mean(s2), 1)),cex=0.7)

s3<-rnorm(1000, 70, 3) 
hist(s3,prob=T,col = "steelblue")
lines(density(s3),col="red",lwd=2)
abline(v=mean(s3),col="darkgreen",lwd=3)
text(75,0.12,paste("N=1000 mean =", 
                   round(mean(s3), 1)),cex=0.7)

s4<-rnorm(10000, 70, 3) 
hist(s4,prob=T,col = "steelblue")
lines(density(s4),col="red",lwd=2)
abline(v=mean(s4),col="darkgreen",lwd=3)
text(78,0.11,paste("N=10000 mean =", 
                   round(mean(s4), 1)),cex=0.6)

Q:2) Suppose two samples (n1, n2) are drawn from same population mean= 70 and standard deviation = 3 ; n1= 50 & n2 = 1000

  • Prove that var(n1) > var(n2)
set.seed(10)
sn1<-rnorm(50,70,3)
sn2<-rnorm(1000,70,3)
var_n1<-3^2/50
var_n2<-3^2/1000
var_n1>var_n2
## [1] TRUE

Hence variance of smaller is higher than variance of bigger sample

Since all distributions have means and variances, the distribution of \(\bar{x}\) must also have a mean and a variance, denoted by \(\mu_{\bar{x}}\) & \(\sigma_{\bar{x}}^2\)

These quantities are given by the following simple expressions:

fx6

  • The formula \(\mu_{\bar{x}} = \mu\) shows that the sample mean is unbiased for the population mean. (“The mean of the mean is the mean”).

and variance of the mean is:

fx7

  • What inference can be drawn about the relationship between sample size and variance? Prove it with varying sample size
set.seed(10)
sn1<-rnorm(50,70,3)
sn2<-rnorm(1000,70,3)
sn3<-rnorm(10000,70,3)
var_n1<-3^2/50
var_n2<-3^2/1000
var_n3<-3^2/10000

var_n1>var_n2
## [1] TRUE
var_n1>var_n3
## [1] TRUE
var_n2>var_n3
## [1] TRUE

Inference:

More importantly,\(\sigma_{\bar{x}}\) increases with the population standard deviation,and decreases as the sample size increases.

For a given value of \(\sigma\), the larger the sample size, the more tightly distributed \(\bar{X}\) is around \(\mu\) (and therefore the higher the probability that \(\bar{X}\) will be close to \(\mu\)).


devtools::session_info()
## Session info --------------------------------------------------------------
##  setting  value                       
##  version  R version 3.2.1 (2015-06-18)
##  system   x86_64, mingw32             
##  ui       RTerm                       
##  language (EN)                        
##  collate  English_India.1252          
##  tz       Asia/Calcutta               
##  date     2015-10-24
## Packages ------------------------------------------------------------------
##  package   * version date       source        
##  devtools    1.9.1   2015-09-11 CRAN (R 3.2.2)
##  digest      0.6.8   2014-12-31 CRAN (R 3.2.0)
##  evaluate    0.8     2015-09-18 CRAN (R 3.2.2)
##  htmltools   0.2.6   2014-09-08 CRAN (R 3.2.1)
##  knitr       1.11    2015-08-14 CRAN (R 3.2.2)
##  magrittr    1.5     2014-11-22 CRAN (R 3.2.1)
##  memoise     0.2.1   2014-04-22 CRAN (R 3.2.0)
##  rmarkdown   0.8.1   2015-10-10 CRAN (R 3.2.2)
##  stringi     0.5-5   2015-06-29 CRAN (R 3.2.1)
##  stringr     1.0.0   2015-04-30 CRAN (R 3.2.1)
##  yaml        2.1.13  2014-06-12 CRAN (R 3.2.1)

If you want to view the document in the browser, follow the link here: Assignment3_CAEA_SavitaKulkarni_R_Prob Distr.