Data 605 Final Project

Youtube Video Link

https://www.youtube.com/watch?v=AkDOaz5iAWs&feature=youtu.be

set.seed(123)
n = 100
X <- runif(10000, min = 6, max = n)

mean = (n + 1 )/2
sd = (n + 1 )/2

Y <- rnorm(10000, sd= sd , mean= mean)

summary(Y)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -143.69   16.90   50.27   50.60   85.66  244.81

summary(X)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   6.006  29.772  52.489  52.770  75.879  99.994

# 'x' is median of X variable
x <- median(X)
 x

## [1] 52.48936

# 'y' is 1st quartile of Y variable
y <- quantile(Y)[2]
y

##      25% 
## 16.89941

a. P(X>x | X>y) b. P(X>x, Y>y) c. P(Xy)

df <- as.data.frame(X)
df$Y <- Y 

##P(X>x | X>y) is denoted as P_a

#P(A | B ) = P(A and B )/P(B)

Prob_A_and_B <- NROW(df[df$X > x & df$X >y,])/NROW(df)
Prob_B <- NROW(df[df$X >y,])/NROW(df)
P_a <- Prob_A_and_B/Prob_B
P_a

## [1] 0.5638886

## P(X>x, Y>y) 

P_b <- NROW(df[df$X > x & df$Y >y,])/NROW(df)
P_b

## [1] 0.3756

##  P(X<x | X>y) is denoted as P_c

#P(C | B ) = P(C and B )/PB
Prob_C_and_B = NROW(df[df$X < x & df$X >y,])/NROW(df)
P_c <- Prob_C_and_B/Prob_B
P_c

## [1] 0.4361114

Qn 2: Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

Below is the code to generate the marginal and joint probabilites from randomly generated X, Y data.

cell_0_0 <- NROW(df[df$X > x & df$Y > y,])
cell_0_1 <- NROW(df[df$X < x & df$Y > y,])
cell_0_2 <- NROW(df[df$X == x & df$Y > y,])

cell_1_0 <- NROW(df[df$X > x & df$Y < y,])
cell_1_1 <- NROW(df[df$X < x & df$Y < y,])
cell_1_2 <- NROW(df[df$X == x & df$Y < y,])

cell_2_0 <- NROW(df[df$X > x & df$Y == y,])
cell_2_1 <- NROW(df[df$X < x & df$Y == y,])
cell_2_2 <- NROW(df[df$X == x & df$Y == y,])

t <- c(cell_0_0, cell_0_1, cell_0_2)
t <- rbind(t,c(cell_1_0, cell_1_1, cell_1_2))
t <- rbind(t,c(cell_2_0, cell_2_1, cell_2_2))
t <- cbind(t, t[,1] + t[,2] + t[,3])
t <- rbind(t, t[1,] + t[2,] + t[3,])

colnames(t) <- c('X>x', 'X<x', 'X=x', 'Total')
rownames(t) <- c('Y>y', 'Y<y', 'Y=y', 'Total')

t

##        X>x  X<x X=x Total
## Y>y   3756 3744   0  7500
## Y<y   1244 1256   0  2500
## Y=y      0    0   0     0
## Total 5000 5000   0 10000

# P(X>x and Y>y)=P(X>x)P(Y>y)

Now, building a anonymous function to determine the probabilities and apply that fn to all elem in the dataframe.

func = function(x) x/10000
p <- apply(t, MARGIN = c(1,2), FUN = func)
p

##          X>x    X<x X=x Total
## Y>y   0.3756 0.3744   0  0.75
## Y<y   0.1244 0.1256   0  0.25
## Y=y   0.0000 0.0000   0  0.00
## Total 0.5000 0.5000   0  1.00

# P(X>x and Y>y)=P(X>x)P(Y>y) is denoted as P(A and B) = P(A) * P(B)

According to the above contigency probabily table, we calculate the probability of P(A and B), P(A) and P(B), prooved that P(A and B) is equal to the P(A) * P(B)

P_A_B <- 0.3756
P_A <- 0.5000
P_B <- 0.75

P_AB <- P_A * P_B 
P_AB

## [1] 0.375

P_A_B

## [1] 0.3756

Qn 3 Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

Lets feed the contigency table into chisq test api. The contigency table only contains the joint proportions. You have 2 hypothesis. 1. Null hypothesis -> X, Y are independent 2. Alternate Hypothesis -> X, Y are dependent.

mat <- matrix(c(3756, 3744, 1244,  1256), 2, 2, byrow=T) 
chi <- chisq.test(mat, correct = TRUE)

chi

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mat
## X-squared = 0.064533, df = 1, p-value = 0.7995

Since the p-value is greater than 0.05, we cant reject null hypothesis that means X, Y are independent. now, lets Fischer test to find the p-value.

fis <- fisher.test(mat, conf.int = T , conf.level = 0.95)
fis

## 
##  Fisher's Exact Test for Count Data
## 
## data:  mat
## p-value = 0.7995
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9242273 1.1100187
## sample estimates:
## odds ratio 
##   1.012883

In Fischer test, p-value comes as 0.7995 which is same as Chi Sqaure P-value. Some of the Chi Square and Fischer test are given below.

Fischer test is most suited for small samples or proportions. Chi Square is suited for large samples.
Fischer test is highly computational. This is the best hyposthesis testing and gives correct p-value in proportion is small or large.

Data 605 Final Project - Part 1

Youtube Video Link

Qn1 : Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

a. P(X>x | X>y) b. P(X>x, Y>y) c. P(Xy)

Qn 2: Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

Qn 3 Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?