Your final is due by the end of the last week of class. You should post your solutions to your GitHub account or RPubs. You are also expected to make a short presentation via YouTube and post that recording to the board. This project will show off your ability to understand the elements of the class.
Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6.
Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(μ=σ=(N+1)/2\).
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
## [1] 145.0453
## [1] 48.85452
\[P(X>x | X>y) = \frac{P(X>145.0453) \quad P(X>48.85452) }{P(X>48.85452)}\]
## [1] 0.5
## [1] 0.8318
## [1] 0.601106
0.6 is the probability that X is greater than its median given that X is greater than the first quartile of Y.
\[P(X>x , X>y) = {P(X>145.0453) . P(Y>48.85452) }\]
## [1] 0.369
0.36 is the probability that X and Y are greater than all possible x and y.
\[P(X<x | X>y) = \frac{P(X<145.0453) . P(X>48.85452) }{P(X>48.85452)}\]
## [1] 0.3318
## [1] 0.8318
## [1] 0.398894
0.39 is the probability of X less than its median and greater than the first quantile of Y.
Investigate whether \(P(X>x \quad and \quad Y>y)=P(X>x)P(Y>y)\) by building a table and evaluating the marginal and joint probabilities.
(res <- matrix(c(sum(X>x & Y<y),sum(X>x & Y>y), sum(X<x & Y<y),sum(X<x & Y>y)), ncol = 2, nrow = 2))
## [,1] [,2]
## [1,] 1310 1190
## [2,] 3690 3810
res <- cbind(res,c(res[1,1] + res[1,2], res[2,1] + res[2,2]))
res <- rbind(res,c(res[1,1] + res[2,1], res[1,2] + res[2,2], res[1,3] + res[2,3]))
(results <- as.data.frame(res))
Check \(P(X>x \quad and \quad Y>y)=P(X>x)P(Y>y)\) Check the right side : \(P(X>x)P(Y>y)\) from the table we get
## [1] 0.38
Check the leftside: \(P(X>x \quad and \quad Y>y)\) from the table = 0.369 ~ 0.38
Since the results are so similar we can conclude that both X and Y are independent variable
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
##
## Fisher's Exact Test for Count Data with simulated p-value (based
## on 2000 replicates)
##
## data: results
## p-value = 0.1079
## alternative hypothesis: two.sided
##
## Pearson's Chi-squared test
##
## data: results
## X-squared = 7.68, df = 4, p-value = 0.104
“Fisher’s exact test” is a way to test the association between two categorical variables when you have small cell sizes (expected values less than 5). While, Chi-square test is used when the cell sizes are expected to be large. If the sample size is small (or expected cell sizes <5),Fisher’s exact test should be used. Otherwise, the two tests will give relatively the same answers. With large cell sizes, their answer should be very similar.