Libraries Used:
Using ‘R’, generate a random variable \(X\) that has 10,000 random uniform numbers from 1 to \(N\), where \(N\) can be any number of your choosing greater than or equal to 6. Then generate a random variable \(Y\) that has 10,000 random normal numbers with a mean of \(\mu = \sigma = (N+1)/2\).
# Define N, the max value
N <- 13
# each variable will hold n values where:
n <- 10000
#generate random uniform variable X, with n values from range 1 to N
X <- runif( n, min = 1, max = N )
#generate random normal variable Y, with n values and mean = sigma = (N+1)/2
Y <- rnorm( n, mean = (N+1)/2, sd = (N+1)/2 )
data_df <- data.frame( cbind( X, Y ) )
head( data_df )## X Y
## 1 7.982340 8.217923
## 2 7.408306 16.656213
## 3 2.275263 -5.373129
## 4 4.919211 -3.260230
## 5 4.892818 9.081257
## 6 5.812661 2.911271
take a momnt to evaluate the variable:
## [1] "length X: 10000 length Y: 10000"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.937 7.069 7.022 10.024 13.000
## [1] "(N+1)/2 = 7"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -25.373 2.244 7.132 7.028 11.772 30.648
## [1] "sigma(Y) = 7.02"
Both \(X\) & \(Y\) have the expected dimensions and summary statistics
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
#from the problem it is given that:
x <- median( X )
y <- summary( Y )[ 2 ]
res <- paste( 'The median of X, x =', x, '\nThe 1st quartile of Y, y =', y )
cat( res )## The median of X, x = 7.06896666111425
## The 1st quartile of Y, y = 2.2438171496078
\(P(X \gt x | X \gt y) = \frac{P(X \gt x \cap X \gt y)}{P(X \gt y)}\)
joint <- data_df %>% filter( X > x, Y > y )
jointp <- dim( joint )[ 1 ]/ n
marg <- data_df %>% filter( X > y )
margp <- dim( marg )[ 1 ]/ n
condp <- jointp / margpThe conditional probability, \(P(X \gt x | X \gt y) =\) 0.4173176
The joint probability, \(P(X \gt x, X \gt y) =\) 0.374
\(P(X \lt x | X \lt y) = \frac{P(X \lt x \cap X \lt y)}{P(X \lt y)}\)
joint <- data_df %>% filter( X < x, Y < y )
jointp <- dim( joint )[ 1 ]/ n
marg <- data_df %>% filter( X < y )
margp <- dim( marg )[ 1 ]/ n
condp <- jointp / margpThe conditional probability, \(P(X \lt x | X \lt y) =\) 1.194605
The following code builds a joint probability table with marginal probabilities:
getjoints <- data_df %>% mutate( jp1 = ( X <= x & Y <= y),
jp2 = ( X <= x & Y > y ),
jp3 = ( X > x & Y <= y ),
jp4 = ( X > x & Y > y ) )
jointps <- colSums( getjoints[,3:6] )/n
jointps <- data.frame( matrix( jointps, ncol=2, byrow=TRUE ) )
colnames( jointps ) <- c( 'P(Y<=y)', 'P(Y>y)' )
rownames( jointps ) <- c( 'P(X<=x)', 'P(X>x)' )
jointps <- jointps %>% mutate( 'Total' = rowSums(.[1:2] ) )
jointps[ 'Total' ,] <- colSums( jointps )
rownames( jointps ) <- c( 'P(X<=x)', 'P(X>x)', 'Total' )
jointps## P(Y<=y) P(Y>y) Total
## P(X<=x) 0.124 0.376 0.5
## P(X>x) 0.126 0.374 0.5
## Total 0.250 0.750 1.0
From these values comparisons can be made:
\(P(X \gt x \mbox{ & } Y \gt y) =\) 0.374
\(P(X \gt x) P(Y \gt y) =\) 0.5 \(\cdot\) 0.75 = 0.375 As a result, we see that the statement \(P(X \gt x \mbox{ & } Y \gt y) = P(X \gt x) P(Y \gt y)\) evaluates true for this data; a good indication that the data features are independent.
getmargins <- data_df %>% mutate( m1 = ( X <= x ),
m2 = ( Y <= y ),
m3 = ( X > x ),
m4 = ( Y > y ) )
margins <- colSums( getmargins[,3:6] )
margins <- matrix( margins, 2, 2, dimnames = list( c( 'X', 'Y' ), c( '<=', '>' ) ) )
margins## <= >
## X 5000 5000
## Y 2500 7500
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: margins
## X-squared = 1332.3, df = 1, p-value < 2.2e-16
##
## Fisher's Exact Test for Count Data
##
## data: margins
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 2.824333 3.186757
## sample estimates:
## odds ratio
## 3
The p-values for both tests are very small p-value < 2.2e-16, therefore we can reject the null hypothesis and conclude the the two distributions, \(X\) & \(Y\), are statistically significantly associated. Both the \(\chi ^2\) and Fisher’s Exact tests are statistical methods of independence between data features. Fisher’s test yields an exact result whereas \(\chi ^2\) is approximately accurate. However, because this is a very large sample size and \(\chi ^2\) accuracy increases with sample size, \(\chi ^2\) is more appropriate for this application.