require(moaic)
require(dplyr)
The e.c.d.f. (empirical cumulative distribution function) Fn is a step function with jumps i/n at observation values, where i is the number of tied observations at that value. Missing values are ignored. or observations x= (x1,x2, … xn), Fn is the fraction of observations less or equal to t, i.e.,
\[ Fn(t) = #{xi <= t}/n = 1/n sum(i=1,n) Indicator(xi <= t) \]
X<-c(3.7,2.7,3.3,1.3,2.2,3.1)
n<-length(X)
Fn<-ecdf(X) #empirical cdf
Fn
## Empirical CDF
## Call: ecdf(X)
## x[1:6] = 1.3, 2.2, 2.7, ..., 3.3, 3.7
plot(Fn)
a<-Fn(X) # returns the percentile for x
a
## [1] 1.0000000 0.5000000 0.8333333 0.1666667 0.3333333 0.6666667
Defining Quantiles. We (and R) define quantiles for the set of values (3.7, 2.7, 3.3, 1.3, 2.2, 3.1) as follows: First sort the values into order (1.3, 2.2, 2.7, 3.1, 3.3, 3.7) and then associate the ordered values with sample fractions equally spaced from zero to one.
Sample fraction: 0 .2 .4 .6 .8 1
Quantile: 1.3 2.2 2.7 3.1 3.3 3.7
The other quantiles can be obtained by linear interpolation. Read the article posted on blackboard ofr further information.
#par(mfrow=c(2,2))
xaxis.values<-(1:n-1)/(n-1) # you can use below or sinmply spell out the sequence for x-axis
xaxis.values
## [1] 0.0 0.2 0.4 0.6 0.8 1.0
xaxis.values1<-(0:n)/n # you can use this its going ot round it up to same as above
xaxis.values1
## [1] 0.0000000 0.1666667 0.3333333 0.5000000 0.6666667 0.8333333 1.0000000
q.emperical<-quantile(X) #empirical quantile of x
q.emperical
## 0% 25% 50% 75% 100%
## 1.300 2.325 2.900 3.250 3.700
p.emperical<-quantile(X, seq(0,1,by=.1)) # if you want to see percentiles, change the probability sequence, i.e., second argument
p.emperical
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 1.30 1.75 2.20 2.45 2.70 2.90 3.10 3.20 3.30 3.50 3.70
# looking at quantiles visually
fig1<-plot(xaxis.values,sort(X), type="b",xlab="Sample Fraction",ylab="Quantile", main="Computing the median and quantile")
Quantile Quantile plots (QQplots) allow us to compare the quantiles of two sets of numbers they can also be used to compare distribution of one set of random values (emperical distribution) with some theoretical distribution this kind of comparison is much more detailed than a simple comparison of means or median of a distribution but we need more observations for such comparisons. Comparing empirical and theoretical quantiles without using qqplot by using plot function
set.seed(123456789)
y<-rnorm(n,mean(X),var(X))
q.theroetical<-quantile(y) #theoretical quantile of the normal distribution
q.theroetical
## 0% 25% 50% 75% 100%
## 1.551473 2.197435 2.633718 3.072814 3.772186
X<-c(3.7,2.7,3.3,1.3,2.2,3.1) # Observed Sample
q.emperical<-quantile(X) # empirical Quantile, quanrtiles of the observed sample
q.emperical
## 0% 25% 50% 75% 100%
## 1.300 2.325 2.900 3.250 3.700
plot(q.theroetical,q.emperical, xlab="Emperical Quantiles", ylab="Theroetical Quantiles", main="Q-QPlot using plot function")
abline(a=0,b=1,lty="dotted") #adding a reference line
xlimit<-range(0,X)
xlimit
## [1] 0.0 3.7
ylimit<-range(0,y)
ylimit
## [1] 0.000000 3.772186
# Now we produce Q-Q-Plot with expanded limits and mark in the line y=x
qqplot(X,y, xlim=xlimit, ylim=ylimit, xlab="Empirical Quantiles", ylab="Theoretical Quantiles", main="QQPlot using QQPlot function")
abline(a=0,b=1,lty="dotted")
Using QQnorm, if we decide to compare empirical with theoretical quantiles. If the values being plotted resemble a sample from a normal distribution, they will lie on straight line with intercept =mean of the values and slope =SD of values
qqnorm(X,main="QQPlot using QQnorm to compare observed random variable with normal")
qqline(X) # adding straight line to the plot, the line passes through the first and third quartiles
Suppose we generate Chi-square random Variable using transformation method Z~ \[ N(0,1) \] then V=Z^2~ \[ \chi_{(1)}^{2} \]. The sample data (V) can be compared with the Chisq(1) distribution using a QQPlot, if the sample distirbution is Chisq, then the plot should be linear.
par(mfrow=c(2,2))
set.seed(123456989)
n1<-1000
Z<-rnorm(n1)
V<-Z^2
W<-rchisq(n1,df=1)
#W
W1<-qchisq(ppoints(n1),df=1) #ppoints() is used in qqplot and qqnorm to generate the set of probabilities at which to evaluate the inverse distribution.
#V
qqplot(W,V, xlab="Chi-Squared(1) using rchisq", ylab="Sampled Chisquare")
abline(0,1)
qqplot(W1,V, xlab="Chi-Squared(1) using qchisq", ylab="Sampled Chisquare", main="QQPlot of the auto ordered sample")
abline(0,1) # the line X=q is added for refrence
# plotting histogram and superimposing chisquared
hist(V, prob=TRUE,main="Histogram with 2 vs. 1 df")
curve( dchisq(x, df=1), col='green', add=TRUE)
curve( dchisq(x, df=2), col='red', add=TRUE )
# plotting empirical density and superimposing chisquared
hist(V,prob=TRUE, main="Hist with emp density and chisq(1)")
curve( dchisq(x, df=1), col='green', add=TRUE)
lines( density(V), col='orange') #It may be easier to compare the therotical curve to a density estimate rather than