Problem 1
Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of mu=theta=(N+1)/2.
set.seed(999)
a = 1
n = 12
X = runif(10000, min = 1, max = n)
Y = rnorm(10000, mean = (n+1)/2 ,sd = (n+1)/2)
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities. 5 points a. P(X>x | X>y) b. P(X>x, Y>y) c. P(X<x | X>y)
#Standard Deviation for X
X_sd = (n-a)/sqrt(n)
#Standard Deviation for Y
Y_sd = (n+1)/2
Y_mu = (n+1)/2
#To calculate the 1st quartile of Y, we subtract the product of the z score and sd, from the mean of Y.
y = Y_mu + qnorm(.25)*Y_sd
x = ((n-a)/2)+a
\[Conditional \space Probability \\P(F|E)= \frac{P(F \cap E)}{P(E)} \] \[ f(x) = \frac{{n_1 \choose x} {n_3 - n_1\choose n_2 - x}}{{n_3 \choose n_2}}. \]
\[Standard \space Deviation \space of \space Uniform \space Distibution \\ \sigma = \frac{b-a}{\sqrt{12}}\]
We note that y<x, therefore X>x is a subset of X>y. This tells us that X>x is the intersection of the two sets. Thus, P(X>x | X>y) = P(X>x)/P(X>y)
#We can apply the conditional probability formula
P_a = round(( ((n-x)/n) / ((n-y)/n) ),2)
print("This expression describes the probability that X is larger than x, given that it is larger than y")
## [1] "This expression describes the probability that X is larger than x, given that it is larger than y"
print(paste0("P(X>x | X>y)= ", P_a))
## [1] "P(X>x | X>y)= 0.56"
This expression describes the circumstance that X is larger than x, and Y is larger that y. Assuming these are independent yields the below results. P(X>x) = .5 and P(Y>y) = .75 P(X>x, Y>y) = P(X>x) * P(Y>y) = .375
This expression describes the chance of X being less than x, given that it is greater that y.
P_c = round((x-y)/(n-y),2)
P_c
## [1] 0.44
5 points. Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
df = data_frame(X,Y) %>%
mutate( Xtophalf = X>x, YtopHalf= Y>y)%>%
group_by(Xtophalf, YtopHalf)%>%
summarise(countx = n())%>%
pivot_wider(names_from = YtopHalf, values_from = countx)%>%
rename(X_Y_Top_Half = 1)
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
print("As expected the joint probability table suggests that the variables are independent")
## [1] "As expected the joint probability table suggests that the variables are independent"
print(df)
## # A tibble: 2 x 3
## # Groups: X_Y_Top_Half [2]
## X_Y_Top_Half `FALSE` `TRUE`
## <lgl> <int> <int>
## 1 FALSE 1307 3770
## 2 TRUE 1197 3726
5 points. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
#1st step is to construct the expected value matrix
exp_m = matrix(c(1250,1250,3750,3750), ncol = 2)
\[f(x) = \frac{{2514 \choose x}{7496 \choose 5077 - x}}{{10000 \choose 5077}}.\]
sum = 0
for (i in 1037:5077){
sum = sum + dhyper(i,2514,7496,5077-i)
}
print("The Null Hypothesis is that there is no correlation between the variable, the p-value shown below is to high to reject that hypothesis. In other words, we have no reason to believe they are correlated.")
## [1] "The Null Hypothesis is that there is no correlation between the variable, the p-value shown below is to high to reject that hypothesis. In other words, we have no reason to believe they are correlated."
print(sum)
## [1] 0.1226736