Note this lab is based on lab from https://data.library.virginia.edu/understanding-empirical-cumulative-distribution-functions/
Imagine a simple event, say flipping a coin 3 times. Here are all the possible outcomes, where H = head and T = tails. We can quickly generate the probabilities in R using the dbinom function
dbinom(0:3, size = 3, prob = 0.5)
## [1] 0.125 0.375 0.375 0.125
The function used to generate these probabilities is often referred to as the “density” function, hence the “d” in front of binom. Distributions that generate probabilities for discrete values, such as the binomial in this example, are sometimes called “probability mass functions” or PMFs. Distributions that generate probabilities for continuous values, such as the Normal, are sometimes called “probability density functions”, or PDFs. However in R, regardless of PMF or PDF, the function that generates the probabilities is known as the “density” function.
We can quickly visualize this probability distribution with the barplot function:
barplot(dbinom(x = 0:3, size = 3, prob = 0.5), names.arg = 0:3)
These are probabilities that accumulate as we move from left to right along the x-axis in our probability distribution. The distribution of these probabilities is known as the cumulative distribution. Again there is a function in R that generates these probabilities for us. Instead of a “d” in front of “binom” we put a “p”.
The plot is sometimes called a step plot. As soon as you hit a point on the x-axis, you “step” up to the next probability. The probability of 0 or less is 0.125. Hence the straight line from 0 to 1. At 1 we step up to 0.5, because the probability of 1 or less if 0.5. And so forth. At 3, we have a dot at 1. The probability of 3 or fewer is certainty. We’re guaranteed to get 3 or fewer successes in our binomial distribution.
pbinom(0:3, size = 3, prob = 0.5)
## [1] 0.125 0.500 0.875 1.000
# plotting cdf
plot(0:3, pbinom(0:3, size = 3, prob = 0.5),
ylim = c(0,1),
xaxt='n', pch = 19,
ylab = 'Prob', xlab = 'x')
axis(side = 1, at = 0:3, labels = 0:3)
segments(x0 = c(0, 1, 2), y0 = c(0.125, 0.5, 0.875),
x1 = c(1, 2, 3), y1 = c(0.125, 0.5, 0.875))
The first argument, dnorm(x), is basically the math formula that draws the line. Notice the “d” in front of “norm”; this is the “density” function. The defaults of the dnorm function is mean = 0 and standard deviation = 1. The from and to arguments say draw this curve using values of x ranging from -3 to 3.
curve(dnorm(x), from = -3, to = 3)
The curve is a smooth line, which means it’s a probability distribution for all real numbers. The area under the curve is 1 because it’s a probability distribution.
Imagine reaching into this distribution and drawing a number. What’s the probability of getting 1.1? It’s essentially 0. Why? Because there’s \(\frac{1}{\infty}\) chance of selecting it. Why is $$in the denominator? Because there is an infinite number of possibilities. The y axis values don’t represent probability but rather “density”. The density is essentially the probability of a small range of values divided by that range. Remember we don’t use normal distributions (or any continuous distribution) to get exact probabilities. We use them to get probabilities for a range of values.
For example, what is the probability that x is less than or equal to -1? For this we can use the pnorm function, which is the cumulative distribution function for the normal distribution.
mosaic::plotDist('norm', groups = x < -1,
type='h', col = c('pink', 'green'))
This plot actually shows cumulative probability. The blue region is equal to 0.1586553, the probability we draw a value of -1 or less from this distribution. Recall we used the cumulative distribution function to get this value. To visualize all the cumulative probabilities for the standard normal distribution, we can again use the curve function but this time with pnorm.
Also, If we look at -1 on the x axis and go straight up to the line, and then go directly left to the x axis, it should land on 0.1586553. We can add this to the plot using segments:
curve(pnorm(x), from = -3, to = 3)
segments(x0 = -1, y0 = 0,
x1 = -1, y1 = pnorm(-1), col = 'red')
segments(x0 = -1, y0 = pnorm(-1),
x1 = -3, y1 = pnorm(-1), col = 'blue')
The cumulative distributions we explored above were based on theory. We used the binomial and normal cumulative distributions, respectively, to calculate probabilities and visualize the distribution. In real life, however, the data we collect or observe does not come from a theoretical distribution. We have to use the data itself to create a cumulative distribution.
An empirical cumulative distribution function (also called the empirical distribution function, ECDF, or just EDF) and a cumulative distribution function are basically the same thing: they are both probability models for data. However, while a CDF is a hypothetical model of a distribution, the ECDF models empirical (i.e. observed) data. To put this another way, the ECDF is the probability distribution you would get if you sampled from your sample, instead of the population. Let’s say you have a set of experimental (observed) data x1, x2 …,xn. The ECDF will give you the fraction of sample observations less than or equal to a particular value of x.
More formally, if you have a set of order statistics (\(y_1 < y_2 < … < y_n\)) from an observed random sample, then the empirical distribution function is defined as a sum of iid random variables \(F_{n}(t)=\frac{1}{n}\sum_{j=1}^{n}I_{Z_j\le t}\). Where I = the indicator function. The formula is actually easier to work than it looks. The following example shows how you can use the formula to generate an EDF for your experimental data, and how the EDF can be used as a comparison against a hypothetical distribution.
We can do this in R with the ecdf function. ECDF stands for “Empirical Cumulative Distribution Function”. Just as pbinom and pnorm were the cumulative distribution functions for our theoretical data, ecdf creates a cumulative distribution function for our observed data. Let’s try this out with the rock data set that comes with R.
The rock data set contains measurements on 48 rock samples from a petroleum reservoir. It contains 4 variables: area, peri, shape, and perm. We’ll work with the area variable, which is the total area of pores in each sample.
The ecdf functions works on numeric vectors, which are often columns of numbers in a data frame. Below we give it the area column of the rock data frame.
a<-c(0,4,2,2,1)
b<-sort(a)
c<-unique(b)# only 4 unique values 2 is repeated twice
#a
#b
posbl.val<-as.integer(c)
Prob<-c(1/5,1/5,2/5,1/5)
CumProb<-c(1/5,2/5,4/5,5/5)
d<-as.data.frame(rbind(posbl.val,Prob,CumProb))
d
## V1 V2 V3 V4
## posbl.val 0.0 1.0 2.0 4.0
## Prob 0.2 0.2 0.4 0.2
## CumProb 0.2 0.4 0.8 1.0
plot(ecdf(a))
Emp.cdf<-ecdf(rock$area)
plot(Emp.cdf)
Looking at the plot we can see the estimated probability that the area of a sample is less than or equal to 8000 is about 0.6. But we don’t have to rely on eye-balling the graph. We have a function! We can use it to get a more precise estimate. Just give it a number within the range of the x-axis and it will return the cumulative probability. We can get estimated probabilities that area is less than or equal to 4000, 6000, and 8000.
Emp.cdf(8000)
## [1] 0.625
Emp.cdf(c(4000, 6000, 8000))
## [1] 0.1250000 0.3333333 0.6250000
There is also a summary method for ECDF functions. It returns a summary of the unique values of the observed data. Notice it’s similar to the traditional summary method for numeric vectors, but the result is slightly different since it’s summarizing the unique values instead of all values.
summary(Emp.cdf)
## Empirical CDF: 47 unique values with summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1016 5292 7416 7173 8871 12210
summary(rock$area)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1016 5305 7487 7188 8870 12210
We can superimpose a theoretical cumulative distribution over the ECDF. This may help us assess whether or not we can assume our data could be modeled with a particular theoretical distribution. For example, could our data be thought of as having been sampled from a Normal distribution? Below we plot the step function and then overlay a cumulative Normal distribution using the mean and standard deviation of our observed data.
plot(ecdf(rock$area))
curve(pnorm(x, mean(rock$area), sd(rock$area)),
from = 0, to = 12000, add = TRUE, col='red', lwd = 2)
The lines seem to overlap quite a bit, suggesting the data could be approximated with a Normal distribution. We can also compare estimates from our ECDF with a theoretical CDF. We saw that the probability that area is less than or equal to 8000 is about 0.625. How does that compare to a Normal cumulative distribution with a mean and standard deviation of rock$area?
pnorm(8000, mean = mean(rock$area), sd = sd(rock$area))
## [1] 0.6189223
That’s quite close!
Suppose we generate Chi-square random Variable using transformation method Z~ N(0,1) then \(V=Z^2\) follows \(\chi_{(1)}^{2}\). The sample data (V) can be compared with the Chisq(1) distribution using a QQPlot, if the sample distirbution is Chisq, then the plot should be linear. Or we can superimpose, histogram of V with \(\chi_{1}^{2}\) and its empirical denisty to make comparisons graphially.
par(mfrow=c(2,2))
set.seed(12345698)
n1<-1000
Z<-rnorm(n1)
V<-Z^2
W<-rchisq(n1,df=1)
#W
W1<-qchisq(ppoints(n1),df=1) #ppoints() is used in qqplot and qqnorm to generate the set of probabilities at which to evaluate the inverse distribution.
#V
qqplot(W,V, xlab="Chi-Squared(1) using rchisq", ylab="Sampled Chisquare")
abline(0,1)
qqplot(W1,V, xlab="Chi-Squared(1) using qchisq", ylab="Sampled Chisquare", main="QQPlot of the auto ordered sample")
abline(0,1) # the line X=q is added for refrence
# plotting histogram and superimposing chisquared
hist(V, prob=TRUE,main="Histogram with 2 vs. 1 df")
curve( dchisq(x, df=1), col='green', add=TRUE)
curve( dchisq(x, df=2), col='red', add=TRUE )
# plotting empirical density and superimposing chisquared
hist(V,prob=TRUE, main="Hist with emp density and chisq(1)")
curve( dchisq(x, df=1), col='green', add=TRUE)
lines( density(V), col='red') #It may be easier to compare the therotical curve to a density estimate rather than
Another graphical assessment is the Q-Q plot, which can also be easily done in R using the qqnorm and qqline functions. The idea is that if the points fall along the diagonal line then we have good evidence the data are plausibly normal. Again this plot reveals that the data look like they could be well approximated with a Normal distribution. For more on Q-Q plots, see our article https://data.library.virginia.edu/understanding-q-q-plots/