Pick one of the quantitative independent variables (\(X_i\)) from the data set below, and define that variable as X. Also, pick one of the dependent variables (\(Y_i\)) below, and define that as Y.
X1 <- c(9.3, 4.1, 22.4, 9.1, 15.8, 7.1, 15.9, 6.9, 16.0, 6.7,
8.2, 16.0, 6.4, 11.8, 3.5, 21.7, 12.2, 9.3, 8.0, 6.2)
Y1 <- c(20.3, 19.1, 19.3, 20.9, 22.0, 23.5, 13.8, 18.8, 20.9, 18.6,
22.3, 17.6, 20.8, 28.7, 15.2, 20.9, 18.4, 10.3, 26.3, 28.1)
df <- data.frame(X1, Y1)
plot(X1, Y1)
Calculate as a minimum the below probabilities a through c. Assume the small letter “\(x\)” is estimated as the 3rd quartile of the \(X\) variable, and the small letter “\(y\)” is estimated as the 1st quartile of the \(Y\) variable. Interpret the meaning of all probabilities.
(Hint: P(X > 3rd quartile of x values | Y > 1st quartile of y values.)
summary(X1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.50 6.85 9.20 10.83 15.82 22.40
summary(Y1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.30 18.55 20.55 20.29 22.07 28.70
x <- summary(X1)[5]
x
## 3rd Qu.
## 15.825
y <- summary(Y1)[2]
y
## 1st Qu.
## 18.55
Y_gt_y <- df[which(df$Y1>y),]
X_gt_x <- Y_gt_y[which(Y_gt_y$X1>x),]
a <- nrow(X_gt_x)/nrow(Y_gt_y)
a
## [1] 0.2
X_and_Y <- df[which(df$Y1>y & df$X1>x),]
X_and_Y
b <- nrow(X_and_Y)/20
b
## [1] 0.15
X_lt_x <- Y_gt_y[which(Y_gt_y$X1<x),]
c <- nrow(X_lt_x)/nrow(Y_gt_y)
c
## [1] 0.8
| y/x | <= 3rd quartile | > 3rd quartile | Total |
|---|---|---|---|
| <= 1st quartile | 3 | 2 | 5 |
| > 1 st quartile | 12 | 3 | 15 |
| Total | 15 | 5 | 20 |
Does splitting the training data in this fashion make them independent?
Let \(A\) be the new variable counting those observations below the 3rd quartile for \(X\), and let \(B\) be the new variable counting those observations above the 1st quartile for \(Y\).
Does \(P(AB)=P(A)P(B)\)? Check mathematically, and then evaluate by running a chi-square test for association.
#Add A abd B to the data.frame
df$A <- df$X1<x
df$B <- df$Y1>y
# Calculate A = count of observations above the 1st quartile for X
A = sum(df$A)
A
## [1] 15
# Calculate B = count of observations above the 1st quartile for Y
B = sum(df$B)
B
## [1] 15
P_AB <- sum(df$A & df$B)/nrow(df)
P_AB
## [1] 0.6
P_A <- A/nrow(df)
P_B <- B/nrow(df)
P_A * P_B
## [1] 0.5625
P_AB == P_A * P_B
## [1] FALSE
In order to perform a chi-square test we first need to build a contingency table.
contingency_table <- table(df[,3:4])
contingency_table
## B
## A FALSE TRUE
## FALSE 2 3
## TRUE 3 12
A chi-square test examines whether rows and columns of a contingency table have a statistically significant association.
Null Hypothesis \(H_0\): A and B are independent.
Alternative Hypothesis \(H_A\): A and B are not independent. There is a relationship between them.
Using a significance level of 0.05, if the chi-square p-value is less than 0.05 then there is very little chance that we would get the results we did if the null were true and we would reject the null hypothesis.
If the calculated chi-square p-value is greater than the significance level of 0.05, then there is good chance we could get the results we did even if the null is true. This implies that A and B are independent and we would reject the alternative hypothesis in favor of the null.
# Use chisq.test
chisq <- chisq.test(contingency_table, correct = F, p=c(.75, .75))
chisq
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 0.8, df = 1, p-value = 0.3711
# Use summary to find chisq
summary(contingency_table)
## Number of cases in table: 20
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 0.8, df = 1, p-value = 0.3711
## Chi-squared approximation may be incorrect
# Observed counts
chisq$observed
## B
## A FALSE TRUE
## FALSE 2 3
## TRUE 3 12
# Expected counts under the null hypothesis:
round(chisq$expected,2)
## B
## A FALSE TRUE
## FALSE 1.25 3.75
## TRUE 3.75 11.25