RECORDING OF VIDEO: https://us06web.zoom.us/rec/share/VgvJ0oX8t_K5Et7Za6CMqExktRHFVX9qmBz_hvR5nfcqXESWQ9TAjonRpdDQgiMu.J7AGWIXFqeuQWt4A Passcode: fIq%pE?9

Problem 1

Using R, set a random seed equal to 1234 (i.e., set.seed(1234)). Generate a random variable X that has 10,000 continuous random uniform values between 5 and 15. Then generate a random variable Y that has 10,000 random normal values with a mean of 10 and a standard deviation of 2.89.

set.seed(1234)

X <- runif(n=10000, min = 5, max = 15)
Y <- rnorm(n=10000, mean=10, sd =2.89)

Histogram

Use the histogram to verify the distribution of the X and Y.

Histogram of X

The histogram shows that all the values between 5 and 15 have about the same probability.

hist(X)

Histogram of Y

The histogram shows a normal distribtion for the y values.

hist(Y)

Part 1

Probability

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the median of the Y variable. Interpret the meaning of all probabilities. a. P(X>x | X>y) b. P(X>x & Y>y) c. P(X<x | X>y)

First, we assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the median of the Y variable.

x <- median(X)
y <- median(Y)

P(X>x | X>y)

\(P(X>x | X>y)=\frac{P(X>x) \bigcap P(X>y)}{P(X>y)}\)

p_a <- mean(X>x & X>y)/mean(X>y)
p_a

## [1] 1

P(X>x & Y>y)

P(X>x & Y>y) = P(X>x Y>y)

p_b <- mean(X>x & Y>y)
p_b

## [1] 0.2507

P(X<x | X>y)

\(P(X<x | X>y)=\frac{P(X<x) \bigcap P(X>y)}{P(X>y)}\)

p_c <- mean(X<x & X>y)/mean(X>y)
p_c

## [1] 0

Part 2

Investigate whether P(X>x & Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

We can see that from the table, P(X>x & Y>y) = 0.2507. Since P(X>x)=0.5 and P(Y>y)=0.5, then \(P(X>x)P(Y>y)=0.5 \cdot 0.5 = 0.25\)

# you can manually build the tabke
#matrix <- matrix(c(sum(X<=x & Y<=y),
#            sum(X>x & Y<=y),
#            sum(X<=x & Y>y),
#            sum(X>x & Y>y)), nrow=2)
#matrix


table <- table(X>x, Y>y)
rownames(table) <- c("Y<=y","Y>y")
colnames(table) <- c("X<=x","X>x")
table <- cbind(table, Total = rowSums(table))
table <- rbind(table, Total = colSums(table))
table <- table/10000
table

##         X<=x    X>x Total
## Y<=y  0.2507 0.2493   0.5
## Y>y   0.2493 0.2507   0.5
## Total 0.5000 0.5000   1.0

Part 3

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate? Are you surprised at the results? Why or why not?

\(H_{0}\): P(X>x) and P(Y>y) are independent

\(H_{1}\): P(X>x) and P(Y>y) are NOT independent

Chi Square Test

The p-value is 0.7949. Since the p-value is greater than 0.05, we fail to reject the null hypothesis. P(X>x) and P(Y>y) are independent.

Generally, Fisher’s Exact Test is preferable due its exact test as the chi square test applies approximations. It is also used when the sample size is small. The chi square test is used for large sample size as the accuracy increases as the sample size increases. In this case, chi square test is more apporopriate as our sample size is 10,000. Both test has a null hypothesis of the variables are independent. The results are surprise as we generated the X and Y with different distribution.

chisq.test(table(X>x, Y>y))

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(X > x, Y > y)
## X-squared = 0.0676, df = 1, p-value = 0.7949

Fisher’s Exact Test

fisher.test(table(X>x, Y>y))

## 
##  Fisher's Exact Test for Count Data
## 
## data:  table(X > x, Y > y)
## p-value = 0.7949
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9342763 1.0946016
## sample estimates:
## odds ratio 
##   1.011264

Source

https://www.datascienceblog.net/post/statistical_test/contingency_table_tests/

DATA 605 Final Part 1

Susanna Wong

2023-12-05