https://www.youtube.com/watch?v=AkDOaz5iAWs&feature=youtu.be
set.seed(123)
n = 100
X <- runif(10000, min = 6, max = n)
mean = (n + 1 )/2
sd = (n + 1 )/2
Y <- rnorm(10000, sd= sd , mean= mean)
summary(Y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -143.69 16.90 50.27 50.60 85.66 244.81
summary(X)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.006 29.772 52.489 52.770 75.879 99.994
# 'x' is median of X variable
x <- median(X)
x
## [1] 52.48936
# 'y' is 1st quartile of Y variable
y <- quantile(Y)[2]
y
## 25%
## 16.89941
df <- as.data.frame(X)
df$Y <- Y
##P(X>x | X>y) is denoted as P_a
#P(A | B ) = P(A and B )/P(B)
Prob_A_and_B <- NROW(df[df$X > x & df$X >y,])/NROW(df)
Prob_B <- NROW(df[df$X >y,])/NROW(df)
P_a <- Prob_A_and_B/Prob_B
P_a
## [1] 0.5638886
## P(X>x, Y>y)
P_b <- NROW(df[df$X > x & df$Y >y,])/NROW(df)
P_b
## [1] 0.3756
## P(X<x | X>y) is denoted as P_c
#P(C | B ) = P(C and B )/PB
Prob_C_and_B = NROW(df[df$X < x & df$X >y,])/NROW(df)
P_c <- Prob_C_and_B/Prob_B
P_c
## [1] 0.4361114
Below is the code to generate the marginal and joint probabilites from randomly generated X, Y data.
cell_0_0 <- NROW(df[df$X > x & df$Y > y,])
cell_0_1 <- NROW(df[df$X < x & df$Y > y,])
cell_0_2 <- NROW(df[df$X == x & df$Y > y,])
cell_1_0 <- NROW(df[df$X > x & df$Y < y,])
cell_1_1 <- NROW(df[df$X < x & df$Y < y,])
cell_1_2 <- NROW(df[df$X == x & df$Y < y,])
cell_2_0 <- NROW(df[df$X > x & df$Y == y,])
cell_2_1 <- NROW(df[df$X < x & df$Y == y,])
cell_2_2 <- NROW(df[df$X == x & df$Y == y,])
t <- c(cell_0_0, cell_0_1, cell_0_2)
t <- rbind(t,c(cell_1_0, cell_1_1, cell_1_2))
t <- rbind(t,c(cell_2_0, cell_2_1, cell_2_2))
t <- cbind(t, t[,1] + t[,2] + t[,3])
t <- rbind(t, t[1,] + t[2,] + t[3,])
colnames(t) <- c('X>x', 'X<x', 'X=x', 'Total')
rownames(t) <- c('Y>y', 'Y<y', 'Y=y', 'Total')
t
## X>x X<x X=x Total
## Y>y 3756 3744 0 7500
## Y<y 1244 1256 0 2500
## Y=y 0 0 0 0
## Total 5000 5000 0 10000
# P(X>x and Y>y)=P(X>x)P(Y>y)
Now, building a anonymous function to determine the probabilities and apply that fn to all elem in the dataframe.
func = function(x) x/10000
p <- apply(t, MARGIN = c(1,2), FUN = func)
p
## X>x X<x X=x Total
## Y>y 0.3756 0.3744 0 0.75
## Y<y 0.1244 0.1256 0 0.25
## Y=y 0.0000 0.0000 0 0.00
## Total 0.5000 0.5000 0 1.00
# P(X>x and Y>y)=P(X>x)P(Y>y) is denoted as P(A and B) = P(A) * P(B)
According to the above contigency probabily table, we calculate the probability of P(A and B), P(A) and P(B), prooved that P(A and B) is equal to the P(A) * P(B)
P_A_B <- 0.3756
P_A <- 0.5000
P_B <- 0.75
P_AB <- P_A * P_B
P_AB
## [1] 0.375
P_A_B
## [1] 0.3756
Lets feed the contigency table into chisq test api. The contigency table only contains the joint proportions. You have 2 hypothesis. 1. Null hypothesis -> X, Y are independent 2. Alternate Hypothesis -> X, Y are dependent.
mat <- matrix(c(3756, 3744, 1244, 1256), 2, 2, byrow=T)
chi <- chisq.test(mat, correct = TRUE)
chi
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mat
## X-squared = 0.064533, df = 1, p-value = 0.7995
Since the p-value is greater than 0.05, we cant reject null hypothesis that means X, Y are independent. now, lets Fischer test to find the p-value.
fis <- fisher.test(mat, conf.int = T , conf.level = 0.95)
fis
##
## Fisher's Exact Test for Count Data
##
## data: mat
## p-value = 0.7995
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9242273 1.1100187
## sample estimates:
## odds ratio
## 1.012883
In Fischer test, p-value comes as 0.7995 which is same as Chi Sqaure P-value. Some of the Chi Square and Fischer test are given below.