Introduction

Your final is due by the end of the last week of class. You should post your solutions to your GitHub account or RPubs. You are also expected to make a short presentation via YouTube and post that recording to the board. This project will show off your ability to understand the elements of the class.

Problem 1

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6.

set.seed(123)
N <- runif(1, 8, 1000)
n <- 10000
X <- runif(n, min = 0, max = N)

Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(μ=σ=(N+1)/2\).

set.seed(123)
me <- (N+1)/2
sd <- me
Y <- rnorm(n, mean = me, sd = sd)

Probability

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

(x <- median(X))
## [1] 145.0453
(y <- quantile(Y, 0.25)[[1]])
## [1] 48.85452

P(X>x | X>y)

\[P(X>x | X>y) = \frac{P(X>145.0453) \quad P(X>48.85452) }{P(X>48.85452)}\]

(p_a <- sum(X>x & X>y)/n)
## [1] 0.5
(p_xy <- sum(X>y)/n) 
## [1] 0.8318
(p_a/p_xy)
## [1] 0.601106

0.6 is the probability that X is greater than its median given that X is greater than the first quartile of Y.

P(X>x , Y>y)

\[P(X>x , X>y) = {P(X>145.0453) . P(Y>48.85452) }\]

(p_b <- sum(X>x & Y>y)/n)
## [1] 0.369

0.36 is the probability that X and Y are greater than all possible x and y.

P(X<x | X>y)

\[P(X<x | X>y) = \frac{P(X<145.0453) . P(X>48.85452) }{P(X>48.85452)}\]

(p_c <- sum(X<x & X>y)/n)
## [1] 0.3318
(p_xy <- sum(X>y)/n) 
## [1] 0.8318
(p_c/p_xy)
## [1] 0.398894

0.39 is the probability of X less than its median and greater than the first quantile of Y.

Investigating the formula

Investigate whether \(P(X>x \quad and \quad Y>y)=P(X>x)P(Y>y)\) by building a table and evaluating the marginal and joint probabilities.

(res <- matrix(c(sum(X>x & Y<y),sum(X>x & Y>y), sum(X<x & Y<y),sum(X<x & Y>y)),  ncol = 2, nrow = 2))
##      [,1] [,2]
## [1,] 1310 1190
## [2,] 3690 3810
res <- cbind(res,c(res[1,1] + res[1,2], res[2,1] + res[2,2]))
res <- rbind(res,c(res[1,1] + res[2,1], res[1,2] + res[2,2], res[1,3] + res[2,3]))
(results <- as.data.frame(res))
results
colnames(results) <- c("X>x", "X<x", "total")
rownames(results)  <- c("Y<y", "Y>y", "total")

results

Probability matrix

(prob_tab <- results/n)

Check \(P(X>x \quad and \quad Y>y)=P(X>x)P(Y>y)\) Check the right side : \(P(X>x)P(Y>y)\) from the table we get

(round(0.5*0.75, 2))
## [1] 0.38

Check the leftside: \(P(X>x \quad and \quad Y>y)\) from the table = 0.369 ~ 0.38

Since the results are so similar we can conclude that both X and Y are independent variable

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

Fisher test

fisher.test(results,simulate.p.value=TRUE)
## 
##  Fisher's Exact Test for Count Data with simulated p-value (based
##  on 2000 replicates)
## 
## data:  results
## p-value = 0.1079
## alternative hypothesis: two.sided

Chai-Test

chisq.test(results, correct=TRUE)
## 
##  Pearson's Chi-squared test
## 
## data:  results
## X-squared = 7.68, df = 4, p-value = 0.104

Comparasion

“Fisher’s exact test” is a way to test the association between two categorical variables when you have small cell sizes (expected values less than 5). While, Chi-square test is used when the cell sizes are expected to be large. If the sample size is small (or expected cell sizes <5),Fisher’s exact test should be used. Otherwise, the two tests will give relatively the same answers. With large cell sizes, their answer should be very similar.