Data 605 Final Project

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu=\theta=(N+1)/2\).

set.seed(123) 

N = 9

X = runif(10000,1, N)
Y = rnorm(10000,(N+1)/2,(N+1)/2)
x = median(X)
y = quantile(Y,0.25)[[1]]

Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

a. P(X>x | X>y)

Probability X > 5.0 given that X>2.94

paste0('P(X>x | X>y) = ',mean(X[X>y]>x))

## [1] "P(X>x | X>y) = 0.546567555749891"

b. P(X>x, Y>y)

Probability that X > 5 and Y > 1.56 These two are independent, so this is just P(X>x)P(Y>y) Since we know x is the median of X, and y is the first quartile of Y, this should be about 0.5 0.75 - 0.375 Let’s check if that is what we get:

paste0('P(X>x, Y>y) = ',mean(X>x)*mean(Y>y))

## [1] "P(X>x, Y>y) = 0.375"

c. P(X<x | X>y)

Probability X < 5.0 given that X>2.94

paste0('P(X<x | X>y) = ',mean(X[X>y]<x))

## [1] "P(X<x | X>y) = 0.453432444250109"

Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

I realized after looking at the next part that I misunderstood this. I left the following since I did it. Basically I’m just checking if the joint vs marginal match at different values of x and y.

percent_list = c(0.2,0.4,0.6,0.8)
x_list = round(quantile(X,percent_list), 2)
y_list = round(quantile(Y,percent_list), 2)

prob_table = expand.grid(x_list,y_list)
names(prob_table) = c('x','y')

prob_table$marginal = apply(prob_table, 1, function(pt) round(mean(X>pt['x'])*mean(Y>pt['y']),3))
prob_table$joint = apply(prob_table, 1, function(pt) round(mean(X>pt['x'] & Y>pt['y']), 3))
prob_table

Looks pretty close to me!

I think this is actually what I was supposed to do:

cont_table = matrix(
    c(
    sum(X>x & Y>y),
    sum(X<=x & Y>y),
    sum(X>x & Y<=y),
    sum(X<=x & Y<=y)
    ), nrow = 2, ncol = 2, byrow = TRUE,
    dimnames = list(c('Ytrue','Yfalse'), c('Xtrue','Xfalse'))

                    )
cont_table

##        Xtrue Xfalse
## Ytrue   3756   3744
## Yfalse  1244   1256

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

fisher.test(cont_table)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  cont_table
## p-value = 0.7995
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9242273 1.1100187
## sample estimates:
## odds ratio 
##   1.012883

chisq.test(cont_table)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  cont_table
## X-squared = 0.064533, df = 1, p-value = 0.7995

For both we have a high p-value, so we cannot reject the null hypothesis that these two are independent. As far as which is more appropriate, I think they both are fine. We have a high enough sample size for chi-square, but fisher works just fine. It is more computationally intensive but it works in moments using R.