DATA 605 - Final Project

Using R, generate a random variable \(X\) that has 10,000 random uniform numbers from 1 to \(N\), where \(N\) can be any number of your choosing greater than or equal to 6. Then generate a random variable \(Y\) that has 10,000 random normal numbers with a mean of \(\mu = \sigma = (N+1)/2\).

N = 20

# Random variables X and Y
X <- runif(10000,1,N)
Y <- rnorm(10000,N+1,N+1)

Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

x <- median(X)
y <- quantile(Y,0.25,names=FALSE)

a. \(P(X > x | X > y)\)

Here we evaluate what the probability is of \(X\) being greater than it’s median value \(x\) if we know that \(X\) is already greater than \(Y\)’s first quartile \(y\).

We need to find \(P(X > x | X > y) = \frac{P(X > x \text{ and } X > y)}{P(X > y)}\)

a <- length(which(X>y))
ab <- length(which(X>y & X>x))

ab/a

## [1] 0.7104291

b. \(P(X > x, Y > y)\)

Since these are random variables are independent from one another we can multiply the probabilities. We know the probabilities because, by definition, half of the values in \(X\) are above the median, and 75% of the values in \(Y\) are above the first quartile.

\[ P(X > x) = 0.5\\\ \\ P(Y > y) = 0.75\\\ \\ P(X > x) \text{ and } P(Y>y) = (0.5)(0.75) = 0.375 \]

c. \(P(X < x | X > y)\)

Similar to “a” above, we are evaluating a conditional probablity. Given \(X\) above \(Y\)’s first quartile, what is the probability of \(X\) being below it’s own median, \(x\)?

a <- length(which(X>y))
ab <- length(which(X>y & X<x))

ab/a

## [1] 0.2895709

Investigate whether \(P(X>x \text{ and } Y>y) = P(X>x)P(Y>y)\) by building a table and evaluating the marginal and joint probabilities.

tableX <- table(X > x)
kable(tableX, format='html', col.names=c("X > x","Count")) %>%
  kable_styling("striped", full_width = FALSE)

X > x	Count
FALSE	5000
TRUE	5000

tableY <- table(Y > y)
kable(tableY, format='html', col.names=c("Y > y","Count")) %>%
  kable_styling("striped", full_width = FALSE)

Y > y	Count
FALSE	2500
TRUE	7500

tableXY <- table(X > x, Y > y)
kable(tableXY, format='html') %>%
  kable_styling("striped", full_width = FALSE)

	FALSE	TRUE
FALSE	1262	3738
TRUE	1238	3762

Looking at the joint table, there seems to be little change in marginal probabilities when the other random variable is taken into consideration. This supports the belief that these are independent random variables.

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

Fisher’s Exact test takes a similar approach to eyeballing the marginal totals, but in a more (obviously) formal and exact manner. It looks at the probability of getting the same arrangement of counts given that they are independent.

fisher.test(tableXY)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  tableXY
## p-value = 0.5953
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9361461 1.1243231
## sample estimates:
## odds ratio 
##   1.025936

The results show little evidence for rejecting the null hypothesis of independence.

chisq.test(tableXY)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tableXY
## X-squared = 0.28213, df = 1, p-value = 0.5953

The chi-square test provides the same results. This is expected since we are dealing with a large sample size.

As to which test is most appropriate, Fisher’s Exact Test is, well…exact. So it is useful in that regard. It has typically been used with smaller sample sizes, where chi-square becomes less reliable. In this instance, given 10,000 (times two) samples, there is little need to worry about sample size. Otherwise, both tests have similar assumptions (e.g. each record contributes to only one category).

DATA 605 - Final Project

Problem 1

Adam Douglas

5/11/2019