Problem 1

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of mu=theta=(N+1)/2.

set.seed(999)
a = 1
n = 12
X = runif(10000, min = 1, max = n)
Y = rnorm(10000, mean = (n+1)/2 ,sd = (n+1)/2)

Probability.

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities. 5 points a. P(X>x | X>y) b. P(X>x, Y>y) c. P(X<x | X>y)

#Standard Deviation for X
X_sd = (n-a)/sqrt(n)
#Standard Deviation for Y
Y_sd = (n+1)/2
Y_mu = (n+1)/2
#To calculate the 1st quartile of Y, we subtract the product of the z score and sd, from the mean of Y.
y = Y_mu + qnorm(.25)*Y_sd
x = ((n-a)/2)+a

\[Conditional \space Probability \\P(F|E)= \frac{P(F \cap E)}{P(E)} \] \[ f(x) = \frac{{n_1 \choose x} {n_3 - n_1\choose n_2 - x}}{{n_3 \choose n_2}}. \]

\[Standard \space Deviation \space of \space Uniform \space Distibution \\ \sigma = \frac{b-a}{\sqrt{12}}\]

a. P(X>x | X>y)

We note that y<x, therefore X>x is a subset of X>y. This tells us that X>x is the intersection of the two sets. Thus, P(X>x | X>y) = P(X>x)/P(X>y)

#We can apply the conditional probability formula
P_a = round((  ((n-x)/n) / ((n-y)/n)  ),2)
print("This expression describes the probability that X is larger than x, given that it is larger than y")
## [1] "This expression describes the probability that X is larger than x, given that it is larger than y"
print(paste0("P(X>x | X>y)= ", P_a))
## [1] "P(X>x | X>y)= 0.56"

b. P(X>x, Y>y)

This expression describes the circumstance that X is larger than x, and Y is larger that y. Assuming these are independent yields the below results. P(X>x) = .5 and P(Y>y) = .75 P(X>x, Y>y) = P(X>x) * P(Y>y) = .375

c. P(X<x | X>y)

This expression describes the chance of X being less than x, given that it is greater that y.

P_c = round((x-y)/(n-y),2)
P_c
## [1] 0.44

Joint/Marginal

5 points. Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

df = data_frame(X,Y) %>%
  mutate( Xtophalf = X>x, YtopHalf= Y>y)%>%
  group_by(Xtophalf, YtopHalf)%>%
  summarise(countx = n())%>%
  pivot_wider(names_from = YtopHalf, values_from = countx)%>%
  rename(X_Y_Top_Half = 1)
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
print("As expected the joint probability table suggests that the variables are independent")
## [1] "As expected the joint probability table suggests that the variables are independent"
print(df)
## # A tibble: 2 x 3
## # Groups:   X_Y_Top_Half [2]
##   X_Y_Top_Half `FALSE` `TRUE`
##   <lgl>          <int>  <int>
## 1 FALSE           1307   3770
## 2 TRUE            1197   3726

Independence

5 points. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

#1st step is to construct the expected value matrix
exp_m = matrix(c(1250,1250,3750,3750), ncol = 2)

\[f(x) = \frac{{2514 \choose x}{7496 \choose 5077 - x}}{{10000 \choose 5077}}.\]

sum = 0
for (i in 1037:5077){
  sum = sum + dhyper(i,2514,7496,5077-i)
}
print("The Null Hypothesis is that there is no correlation between the variable, the p-value shown below is to high to reject that hypothesis.  In other words, we have no reason to believe they are correlated.")
## [1] "The Null Hypothesis is that there is no correlation between the variable, the p-value shown below is to high to reject that hypothesis.  In other words, we have no reason to believe they are correlated."
print(sum)
## [1] 0.1226736