DATA605 Final Project

Problem 1

Bonnie Cooper


Computational Mathematics: Probability

Libraries Used:

Generating Random Numbers

Using ‘R’, generate a random variable \(X\) that has 10,000 random uniform numbers from 1 to \(N\), where \(N\) can be any number of your choosing greater than or equal to 6. Then generate a random variable \(Y\) that has 10,000 random normal numbers with a mean of \(\mu = \sigma = (N+1)/2\).

##          X         Y
## 1 7.982340  8.217923
## 2 7.408306 16.656213
## 3 2.275263 -5.373129
## 4 4.919211 -3.260230
## 5 4.892818  9.081257
## 6 5.812661  2.911271

take a momnt to evaluate the variable:

## [1] "length X: 10000  length Y: 10000"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.937   7.069   7.022  10.024  13.000
## [1] "(N+1)/2 = 7"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -25.373   2.244   7.132   7.028  11.772  30.648
## [1] "sigma(Y) = 7.02"

Both \(X\) & \(Y\) have the expected dimensions and summary statistics

Probability

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

## The median of X, x = 7.06896666111425 
## The 1st quartile of Y, y = 2.2438171496078

a) \(P(X \gt x | X \gt y)\)

\(P(X \gt x | X \gt y) = \frac{P(X \gt x \cap X \gt y)}{P(X \gt y)}\)

The conditional probability, \(P(X \gt x | X \gt y) =\) 0.4173176

b) \(P(X \gt x, Y \gt y)\)

The joint probability, \(P(X \gt x, X \gt y) =\) 0.374

c) \(P(X \lt x | X \gt y )\)

\(P(X \lt x | X \lt y) = \frac{P(X \lt x \cap X \lt y)}{P(X \lt y)}\)

The conditional probability, \(P(X \lt x | X \lt y) =\) 1.194605

Investigate whether \(P(X \gt x \mbox{ & } Y \gt y) = P(X \gt x) P(Y \gt y)\) by building a table and evaluating the marginal and joint probabilities.

The following code builds a joint probability table with marginal probabilities:

##         P(Y<=y) P(Y>y) Total
## P(X<=x)   0.124  0.376   0.5
## P(X>x)    0.126  0.374   0.5
## Total     0.250  0.750   1.0

From these values comparisons can be made:
\(P(X \gt x \mbox{ & } Y \gt y) =\) 0.374
\(P(X \gt x) P(Y \gt y) =\) 0.5 \(\cdot\) 0.75 = 0.375 As a result, we see that the statement \(P(X \gt x \mbox{ & } Y \gt y) = P(X \gt x) P(Y \gt y)\) evaluates true for this data; a good indication that the data features are independent.

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

##     <=    >
## X 5000 5000
## Y 2500 7500
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  margins
## X-squared = 1332.3, df = 1, p-value < 2.2e-16
## 
##  Fisher's Exact Test for Count Data
## 
## data:  margins
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  2.824333 3.186757
## sample estimates:
## odds ratio 
##          3

The p-values for both tests are very small p-value < 2.2e-16, therefore we can reject the null hypothesis and conclude the the two distributions, \(X\) & \(Y\), are statistically significantly associated. Both the \(\chi ^2\) and Fisher’s Exact tests are statistical methods of independence between data features. Fisher’s test yields an exact result whereas \(\chi ^2\) is approximately accurate. However, because this is a very large sample size and \(\chi ^2\) accuracy increases with sample size, \(\chi ^2\) is more appropriate for this application.