Variables

Pick one of the quantitative independent variables (Xi) from the data set below, and define that variable as X. Also, pick one of the dependent variables (Yi) below, and define that as Y.

I selected X4, Y4

df <- read.csv("./data/data.csv", stringsAsFactors = F, header = T) %>% 
  dplyr::select(X4, Y4) %>%
  plyr::rename(c('X4' = 'X','Y4' = 'Y'))

X <- df$X
Y <- df$Y

Probability

Calculate as a minimum the below probabilities a through c.
Assume the small letter “x” is estimated as the 3d quartile of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable.
Interpret the meaning of all probabilities.

  • Prep for probability
summary(df)
##        X                Y        
##  Min.   :-1.000   Min.   :11.40  
##  1st Qu.: 4.350   1st Qu.:19.43  
##  Median : 8.500   Median :21.30  
##  Mean   : 8.595   Mean   :21.10  
##  3rd Qu.:12.525   3rd Qu.:23.70  
##  Max.   :19.900   Max.   :26.90
# Assign quartile values to variables
x <- quantile(X, probs = 0.75) # x.q3
y <- quantile(Y, probs = 0.25) # y.q1

total <- nrow(df)

#get P(Y>y)
Yy<- df[df$Y > y,]
pY <- round(nrow(Yy) / total, 4)

#get P(X>x)
Xx <- df[df$X > x, ]
pX <- round(nrow(Xx) / total, 4)
  1. P(X>x | Y>y)
#get P(X>x | Y>y)
p1 <- round(nrow(Yy[Yy$X > x,]) / total, 4)
print(paste0("P(X>x | Y>y) = ", p1))
## [1] "P(X>x | Y>y) = 0.15"
  1. P(X>x, Y>y)
p2 <- round(pX * pY, 4)
print(paste0("P(X>x, Y>y) = ", p2))
## [1] "P(X>x, Y>y) = 0.1875"
  1. P(X<x | Y>y)
p3<-round(nrow(df[X<=x & Y>y,])/nrow(Yy), 4)
print(paste0("P(X<x | Y>y) = ", p3))
## [1] "P(X<x | Y>y) = 0.8"

Quartile Table (Count)

c1<-nrow(df[X<x & Y<=y, ])
c2<-nrow(df[X <=x & Y>y, ])
c3<-c1+c2
c4<-nrow(df[X >x & Y<=y, ])
c5<-nrow(df[X >x & Y>y, ])
c6<-c4+c5
c7<-c1+c4
c8<-c2+c5
c9<-c3+c6
  
count.table<-matrix(round(c(c1,c2,c3,
                            c4,c5,c6,
                            c7,c8,c9),3), ncol=3, nrow=3, byrow=TRUE)

colnames(count.table) <- c("<=3d quartile",">3d quartile","Total")
rownames(count.table) <- c('<=1st quartile', '>1st quartile','Total')
count.table<-as.table(count.table)

kable(count.table)
<=3d quartile >3d quartile Total
<=1st quartile 3 12 15
>1st quartile 2 3 5
Total 5 15 20

Independence

Does splitting the training data in this fashion make them independent?

Let A be the new variable counting those observations above the 1st quartile for X, and let B be the new variable counting those observations above the 1st quartile for Y.

Does P(AB)=P(A)P(B)?

Check mathematically, and then evaluate by running a Chi Square test for association.

x.q1 <- quantile(X, probs = 0.25) # x.q1 = 4.35
y.q1 <- quantile(Y, probs = 0.25) # y.q1 = 19.425

A<-subset(df, df$X>x.q1)
B<-subset(df, df$Y>y.q1)
# P(AB)
p.ab <- nrow(subset(df, df$X>x.q1 & df$Y>y.q1)) / total
# P(A)P(B)
pa <- nrow(A) / total
pb <- nrow(B) / total
pa.pb <- pa*pb
p.ab == pa.pb
## [1] FALSE

Split the data into X above/below 1st quartile and Y above/below 1st quartile, does not make \(A\), \(B\) independent. We can take observations and subset them, however, this does not determine whether the probability of one event occurring affects that of different event occurring.

chisq.test(A, B)
## 
##  Pearson's Chi-squared test
## 
## data:  A
## X-squared = 21.076, df = 14, p-value = 0.0997

A chi-square test for independence compares two variables in a contingency table to see if they are related. The chi-squared statistic is a single number that tells you how much difference exists between your observed counts and the counts you would expect if there was no relationship at all in the population. In a more general sense, it tests to see whether distributions of categorical variables differ from each another. Chi squared is used for categorizations and we are using this for numerical variables – we are likely getting a very large contingency table.

The p-value is > 0.05, thus, we fail to reject the null hypothesis that the data are independant. There is not enough evidence to support the claim that the data is independent.