Problem 1

Pick one of the quantitative independent variables (\(X_i\)) from the data set below, and define that variable as X. Also, pick one of the dependent variables (\(Y_i\)) below, and define that as Y.

X1 <- c(9.3, 4.1, 22.4, 9.1, 15.8, 7.1, 15.9, 6.9, 16.0, 6.7, 
        8.2, 16.0, 6.4, 11.8, 3.5, 21.7, 12.2, 9.3, 8.0, 6.2)
Y1 <- c(20.3, 19.1, 19.3, 20.9, 22.0, 23.5, 13.8, 18.8, 20.9, 18.6, 
        22.3, 17.6, 20.8, 28.7, 15.2, 20.9, 18.4, 10.3, 26.3, 28.1)
df <- data.frame(X1, Y1)
plot(X1, Y1)

Probability

Calculate as a minimum the below probabilities a through c. Assume the small letter “\(x\)” is estimated as the 3rd quartile of the \(X\) variable, and the small letter “\(y\)” is estimated as the 1st quartile of the \(Y\) variable. Interpret the meaning of all probabilities.

P(X>x | Y>y)
P(X>x, Y>y)
P(X<x | Y>y)

(Hint: P(X > 3rd quartile of x values | Y > 1st quartile of y values.)

summary(X1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.50    6.85    9.20   10.83   15.82   22.40

summary(Y1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.30   18.55   20.55   20.29   22.07   28.70

x <- summary(X1)[5]
x

## 3rd Qu. 
##  15.825

y <- summary(Y1)[2]
y

## 1st Qu. 
##   18.55

a. P(X>x | Y>y)

Y_gt_y <- df[which(df$Y1>y),]
X_gt_x <- Y_gt_y[which(Y_gt_y$X1>x),]
a <- nrow(X_gt_x)/nrow(Y_gt_y)
a

## [1] 0.2

b. P(X>x, Y>y)

X_and_Y <- df[which(df$Y1>y & df$X1>x),]
X_and_Y

b <- nrow(X_and_Y)/20
b

## [1] 0.15

c. P(X<x | Y>y)

X_lt_x <- Y_gt_y[which(Y_gt_y$X1<x),]
c <- nrow(X_lt_x)/nrow(Y_gt_y)
c

## [1] 0.8

Table of Counts

y/x	<= 3rd quartile	> 3rd quartile	Total
<= 1st quartile	3	2	5
> 1 st quartile	12	3	15
Total	15	5	20

Independence

Does splitting the training data in this fashion make them independent?

Let \(A\) be the new variable counting those observations below the 3rd quartile for \(X\), and let \(B\) be the new variable counting those observations above the 1st quartile for \(Y\).

Does \(P(AB)=P(A)P(B)\)? Check mathematically, and then evaluate by running a chi-square test for association.

Mathematical Test for Independence

#Add A abd B to the data.frame
df$A <- df$X1<x
df$B <- df$Y1>y

# Calculate A = count of observations above the 1st quartile for X
A = sum(df$A)
A

## [1] 15

# Calculate B = count of observations above the 1st quartile for Y
B = sum(df$B)
B

## [1] 15

P_AB <- sum(df$A & df$B)/nrow(df)
P_AB

## [1] 0.6

P_A <- A/nrow(df)
P_B <- B/nrow(df)
P_A * P_B

## [1] 0.5625

P_AB == P_A * P_B

## [1] FALSE

Since \(P(AB) \neq P(A)P(B)\) \(A\) and \(B\) are not independent.

Chi-Square Test

In order to perform a chi-square test we first need to build a contingency table.

contingency_table <- table(df[,3:4])
contingency_table

##        B
## A       FALSE TRUE
##   FALSE     2    3
##   TRUE      3   12

A chi-square test examines whether rows and columns of a contingency table have a statistically significant association.

Null Hypothesis \(H_0\): A and B are independent.

Alternative Hypothesis \(H_A\): A and B are not independent. There is a relationship between them.

Using a significance level of 0.05, if the chi-square p-value is less than 0.05 then there is very little chance that we would get the results we did if the null were true and we would reject the null hypothesis.

If the calculated chi-square p-value is greater than the significance level of 0.05, then there is good chance we could get the results we did even if the null is true. This implies that A and B are independent and we would reject the alternative hypothesis in favor of the null.

# Use chisq.test
chisq <- chisq.test(contingency_table, correct = F, p=c(.75, .75))
chisq

## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 0.8, df = 1, p-value = 0.3711

# Use summary to find chisq
summary(contingency_table)

## Number of cases in table: 20 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 0.8, df = 1, p-value = 0.3711
##  Chi-squared approximation may be incorrect

# Observed counts
chisq$observed

##        B
## A       FALSE TRUE
##   FALSE     2    3
##   TRUE      3   12

# Expected counts under the null hypothesis:
round(chisq$expected,2)

##        B
## A       FALSE  TRUE
##   FALSE  1.25  3.75
##   TRUE   3.75 11.25

A p-value of 0.371 indicates that there is good chance we could have gotten these results even if the null hypothesis is true and A and B are independent. We should reject the alternative hypothesis in favor of the null. A and B are independent.

Our mathematical and chi-square tests for independence do not agree, but with such a small sample size this is not unlikely. In general, a chi-square test should not be used if any of the expected frequency counts are less than 5. Three out of four of the values in our expected counts contingency table above did not meet this criteria.

CUNY SPS DATA 605 - Final - Problem 1

Betsy Rosalen

December 16, 2018