Part One: Independent or Dependent Variables?

1. Use R to answer this set of questions. First, generate 1000 random numbers from the Uniform(0,1) distribution and call it X. Then generate another 1000 random numbers from the Uniform(0,1) distribution and call it Y. Let U = X - Y and V = X + Y.

1a. Are X and Y independent? Use a scatterplot and contour plot to support your answer.
library(MASS)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.6     v dplyr   1.0.3
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x dplyr::select() masks MASS::select()
set.seed(123)
X <- runif(1000, 0, 1)
Y <- runif(1000, 0, 1)
U <- X - Y
V <- X + Y

plot(X, Y)

XY.kde <- kde2d(X, Y)
contour(XY.kde)

It looks like X and Y are independent. This plot shows no clear pattern, showing that the values are not dependent on the others.

1b. Are U and V independent? Use a scatterplot and contour plot to support your answer.
UV.kde <- kde2d(U, V)


plot(U, V)

contour(UV.kde)

These plots also show that there is no clear pattern, showing the independence of U and V.

2. Use R to answer this set of questions. First, generate 1000 random numbers from the Normal(0, 1) distribution and call it Z1. Then generate another 1000 random numbers from the Normal(0,1) distribution and call it Z2. Let U2 = Z1-Z2 and V2 = Z1 + Z2.

2a. Are Z1 and Z2 independent? Use a scatterplot and a contour plot to support your answer.

Z1<-rnorm(1000, 0, 1)
Z2<-rnorm(1000, 0, 1)
U2<- Z1-Z2
V2<- Z1+Z2

plot(Z1, Z2)

Z1Z2.kde <- kde2d(Z1, Z2)
contour(Z1Z2.kde)

These plots show that Z1 and Z2 are independent, but moreso than from the distribution in the previous example

2b. Are U2 and V2 independent? Use a scatterplot and a contour plot to support your answer.

plot(U2, V2)

U2V2.kde<-kde2d(U2, V2)

contour(U2V2.kde)

These two plots also seem independent, as I cannot see any clear pattern.

Part Two: Exploring Distributions

1. We will use R to explore the chi-squared distribution for this set of questions.

a. Simulation 1:

i. Generate 1000 random numbers from the X^2(1) distribution and call it X.
X <- rchisq(1000, 1)
ii. Create a density estimation of the pdf of X. Then make a histogram of X with density on the vertical axis, and graph both the histogram and density estimation on the same plot.
plot(density(X))
hist(X, freq = FALSE, add = TRUE)

iii. Find the mean and variance of X and guess the formulas for the mean and variance of the X^2(1) distribution. Attach your plot with the guessed formulas.
mean(X)
## [1] 0.9853859
var(X)
## [1] 2.12063

Mean of X would be the average of all the values of X divided by how large the sample size is. Variance would be the same as normal variance: \[1/(n-1)*Sigma(xi- xbar)^2\]

b. Simulation 2: Repeat steps (i) through (iii) generating data from the χ(10)2 distribution.

x_10<- rchisq(1000, 10)
plot(density(x_10))
hist(x_10, freq = FALSE, add = TRUE)

mean(x_10)
## [1] 10.02632
var(x_10)
## [1] 19.01786

2. We will use R to explore the Student’s t distribution for this set of questions

a. Simulation 1:

i. Generate 1000 random numbers from the Student’s t distribution with one degree of freedom and call it t1.
t1 <- rt(1000, df = 1)
ii. Make a histogram overlaid with the density distribution.
plot(density(t1))
hist(t1, freq = FALSE, add = TRUE)

iii. Generate 1000 random numbers from the Normal(0,1) distribution and call it z. Then generate 1000 random numbers from the χ^2(1) distribution and call it v.
z<- rnorm(1000)
v <- rchisq(1000, 1)
iv. Now take the ratio of the random normal data and the random chi-squared data divided by the degrees of freedom (in this case it’s 1). Then make a histogram overlaid with a density estimation.
t.ratio <-z/sqrt(v/1)
plot(density(t.ratio))
hist(t.ratio,freq = FALSE, add = TRUE)

v. Compare the plots in part (ii) and (iv). Explain what you observe.

The first distribution seems more standardized than the second one; the second one shows more values in the negative than the other. This shows me there is some kind of skew.

b. Simulation 2:

i. Repeat steps (i) through (v) by generating data from the Student’s t distribution with 30 degrees of freedom and χ^2(30) distribution
t2 <- rt(1000, df = 30)
z2<- rnorm(1000)
v2 <- rchisq(1000, 30)
t2.ratio <- z2/sqrt(v2/30)

plot(density(t2.ratio))
hist(t2.ratio,freq = FALSE, add = TRUE)

c. Understanding quantiles

i. Do some online research on the meaning of quantiles. For instance, take the 0.95-quantile (also known as the 95th percentile). Explain its meaning in your own words.

Quantiles show levels in a distribution so that a proportion of values lie below that level. Taking the 95th percentile, for example, would show 95% of the data is below that point (in this case, below means to the left of that mark), with 5% remaining above.

ii. In R, the command qnorm(0.95) gives the 0.95-quantile of the standard normal distribution.
qnorm(0.95)
## [1] 1.644854
ii. Similarily, qt(0.95, d) gives the 0.95-quantile of the Student t’s distribution with df=d. Report the 0.95-quantiles of the Student t’s distributions with degrees of freedom. Explain what you observe. How do they compare to the .95-quantile from the standard normal distribution?
# d=1
qt(.95, 1)
## [1] 6.313752
# d=2
qt(.95, 2)
## [1] 2.919986
# d=3
qt(.95, 3)
## [1] 2.353363
# d=10
qt(.95, 10)
## [1] 1.812461
# d=20
qt(.95, 20)
## [1] 1.724718
# d=30
qt(.95, 30)
## [1] 1.697261

I notice that as the degrees of freedom increase, the proportion of values below the .95 quantile gets closer and closer to the value of 1. Compared to the standard normal distribution, these values are still a bit higher when compared with this distribution.

3. We will use R to explore the F distribution for this set of questions.

Simulation 1:

i. Generate 1000 random numbers from the F(3, 7) distribution and call it u.
u <- rf(1000, 3, 7)
ii. Make a histogram overlaid with the density distribution.
plot(density(u))
hist(u, freq = FALSE, add = TRUE)

b. Simulation 2:

i. Generate 1000 random numbers from the F(3, 27) distribution and call it v.
v <- rf(1000, 3, 27)
ii. Make a histogram overlaid with the density distribution.
plot(density(v))
hist(v, freq = FALSE, add=TRUE)

c. Compare the distributions from F(3, 7) and F(3, 27). Explain what you observe.

These two look very similar. It doesn’t seem that adding more degrees of freedom to the second degree of freedom input really makes a difference in how the distribution is shown.

d. Use qf(.95, 3, 7) to report the 0.95-quantile of the F(3, 7) distribution. Compare this to the qf(.95, 7, 3). What do you observe?

# F(3, 7)
qf(.95, 3, 7)
## [1] 4.346831
# F(7, 3)
qf(.95, 7, 3)
## [1] 8.886743

I notice that in the F(3, 7) distribution, the output is roughly half of the F(7, 3) distribution. I wonder what the difference between the two degrees of freedom inputs is, and furthermore why does it make this difference?