Computer Lab 1: Distributions in R

Note: This link is very useful to find R Markdown cheats and hacks.

Problem nº1:

Instructions: Select the proper x-values and plot the density function and the distribution function of a Normal Distribution with mean equals to 10 and variance equals to 16.

Observations: Whenever I’m asked to select the proper values for my x vector, I’m being asked to apply the Normal Distribution’s centrality property, which clearly states that most of the values will be found between 4 to 5 standard deviations from the mean:

In this case: my mean\(=10\) & my variance\(=16\) and therefore my standard deviation will be\(=\sqrt{16}=4\) from here I can assume most of my data (meaning the x vector I’m being asked to find) can be found between \(10-5*4=10-20=-10\) (lower bound) and \(10+4*5=10+20=30\) (upper bound): \(x=[-10:30]\)

x<-seq(-10,30,by=0.5)
mean<-10
stdev<-sqrt(16)

And now for the density function:

plot(x,dnorm(x,mean,stdev),type="l",col="red",xlab="valores de x seleccionados",ylab="función de densidad de x")

Finally, for the distribution function:

plot(x,pnorm(x,mean,stdev),type="l", col="blue", xlab="valores de x seleccionados", ylab="función de distribución de x")

Problem nº2:

Instructions: Try to mimic the plot that appears in Wikipedia for the F-distribution.

Comments: Below is some information regardinf the density, distribution function, quantile function and random generation function for the F distribution with df1 and df2 degrees of freedom (and optional non-centrality parameter ncp).

Functions:
- df(x, df1, df2, ncp, log = FALSE)
- pf(q, df1, df2, ncp, lower.tail = TRUE, log.p = FALSE)
- qf(p, df1, df2, ncp, lower.tail = TRUE, log.p = FALSE)
- rf(n, df1, df2, ncp)
Arguments:
- x, q are the quantile vectors.
- p is the probability vector.
- n is the number of observations. If length(n) > 1, the length is taken to be the number required.
- df1, df2 are the degrees of freedom. Inf is allowed.
- ncp is the non-centrality parameter. If omitted the central F is assumed.
- log, log.p are logical variables: if TRUE, probabilities p are given as log(p).
- lower.tail is also a logical variable: if TRUE (default), probabilities are P[X ≤ x], otherwise, P[X > x].

Once we know this we can start to mimic the plot. In the exapmle plot, x goes from 0 to 5 and several df values are used. The F-distribution shown in the example is a central F-distribution, therefore we can omit the ncp and there is no need for a log variable to be specified because we do not wish for the probabilities to be in logarithmic form.

x<-seq(0,5,by=0.01)
#red plot:
d11<-1
d21<-1
#black plot:
d12<-2
d22<-1
#blue plot:
d13<-5
d23<-2
#green plot:
d14<-10
d24<-1
#gray plot:
d15<-100
d25<-100

Now for the plot:

plot(x,df(x,d11,d21),col="red",type="line",lwd=1,xlim=c(0,5),ylim=c(0,2.5))
lines(x,df(x,d12,d22),col="black",lwd=1)
lines(x,df(x,d13,d23), col="blue", lwd=1)
lines(x,df(x,d14,d24), col="green", lwd=1)
lines(x,df(x,d15,d25), col="grey", lwd=1)
legend(2.4,2.5, legend=c("d1=1 & d2=1", "d1=2 & d2=1","d1=5 & d2=2","d1=10 & d2=1","d1=100 & d2=100"),col=c("red","black","blue","green","grey"),lty=1)

Problem nº3:

Instructions: Find the median of an F-distribution with 10 dof in the numerator and 10 dof in the denominator. Note: it will be helpful to use the ’q_(__)’ function.

Here is a very useful guide for the ‘qf()’ function:

Now, in this case, the median will be the value that divides the data in equal parts of data points, meaning the area under the curve we will have for the median as a boundary value will be 0.5.

area_value<-0.5
d1=10
d2=10
x<-qf(area_value,d1,d2)
print(x)

## [1] 1

As we can see in the R chunk above, the median of an F distribution with 10 D.O.F. in the numerator and 10 D.O.F in the denominator will be equal to 1 .

Problem nº4:

Instructions: Create a sample of 10.000 datapoints that follow a χ2 distribution with 20 degrees of freedom by summing up random samples from a normal distribution. Compare the following:

the emprirical distribution function of these datapoints (red), their theoretical distribution function (blue) and the empirical distribution of samples generated using rchisq (dark red)
the density function of a normal distribution with the same mean and variance (black).

Theoretical background: We know from the lectures that the sum of n squared standarised normal distributions will result in a chi squared distribution of n degrees of freedom. In order to build up the empirical dataset, we will generate 20 Normalised Standard Distributions, square them, and them add them up.

Method to generate the empirical data: Firstly, we need to generate the 20 standarised normal distributions using the ’rnorm(__)’ function with the following parameters:

n=10.000
mean=0
sd=1

The easiest way to work with the distributions and operate with them is to do so with a matricial structure.

We need to create a matrix with as many rows as there are data points in our distributions and as many columns as we have distributions. The data we input within the matrix will be the distributions themselves, and we have successfully created 20 standarised normal distributions.

Once this matrix has been built, the first step will be to multiply the matrix by itself with the ’*’ function in R, which multiplies element by element. Now we have 20 squared standarised normal distributions.

Finally, we will add these up in order to create the random chi squared variable we have been asked to build. We will use the ’rowSums(__)’ function.

datapoints<-rnorm(10000*20,mean=0,sd=1)

A <- matrix(datapoints,nrow = 10000,ncol = 20)

A_squared <- A*A

chi_squared <- rowSums(A_squared)

Comparing the empirical results to theoretical ones: It is important to note that when we are asked to plot the density function of a normal distribution with the same mean and variance, they want us to plot the density function of a normal distribution with mean \(\mu=n=20\) and variance \(\sigma^2=2n=40\) (\(n\) will be the nº of d.o.f. in my chi squared distribution ). This information is contained in the lectures.

#creating a theoretical value:
x<-seq(0,100,by=0.01)#we need this vector to plot the densities against, it has 10.000 elements.
x1<-rchisq(10000,20)
plot(density(chi_squared),type="l",col="red",main="Gráfica que compara varias formas de la distribution chi cuadrado.")
lines(x,dchisq(x,20),col="blue")
lines(density(x1),col="dark red")
lines(x,dnorm(x,20,sqrt(40)), col="black")
legend(35,0.07, legend=c("chi cuadrado empírica", "chi cuadrado teórica","chi cuadrado empírica usando rchisq","normal de media=n y varianza=2n"),col=c("red","blue","dark red","black"),lty=1)

### Problem nº5

Instructions: Create a sample of 10,000 datapoints that follow a t-student distribution with 20 degrees of freedom.

Theoretical background: We know from the lectures that a T-student distribution is the quotient between a standarised normal distribution and the square root of a chi-squared distribution divided by its d.o.f.

We will take the same approach as in the previous problem and will contrast it with a sample generated using R’s ‘student’s t’ functions.

z<-rnorm(10000)
chi<-rchisq(10000,20)
student<-z/(sqrt(chi/20))
x<-seq(-50,50,by=0.01)
plot(density(student),col="red",type="line", main="T-student empírica y T-student teórica.")
lines(x,dt(x,20),col="blue")
legend(3,0.4,legend=c("T-student empírica","T-student teórica"),col=c("red","blue"))

Computer Lab 1: Distributions in R

Mariana Fuentes Miralles

25/2/2021

Problem nº1:

Problem nº2:

Problem nº3:

Problem nº4: