Pick one of the quantitative independent variables (Xi) from the data set below, and define that variable as X. Also, pick one of the dependent variables (Yi) below, and define that as Y.
I selected X4, Y4
df <- read.csv("./data/data.csv", stringsAsFactors = F, header = T) %>%
dplyr::select(X4, Y4) %>%
plyr::rename(c('X4' = 'X','Y4' = 'Y'))
X <- df$X
Y <- df$Y
Calculate as a minimum the below probabilities a through c.
Assume the small letter “x” is estimated as the 3d quartile
of the X variable, and the small letter “y” is estimated as the 1st quartile
of the Y variable.
Interpret the meaning of all probabilities.
summary(df)
## X Y
## Min. :-1.000 Min. :11.40
## 1st Qu.: 4.350 1st Qu.:19.43
## Median : 8.500 Median :21.30
## Mean : 8.595 Mean :21.10
## 3rd Qu.:12.525 3rd Qu.:23.70
## Max. :19.900 Max. :26.90
# Assign quartile values to variables
x <- quantile(X, probs = 0.75) # x.q3
y <- quantile(Y, probs = 0.25) # y.q1
total <- nrow(df)
#get P(Y>y)
Yy<- df[df$Y > y,]
pY <- round(nrow(Yy) / total, 4)
#get P(X>x)
Xx <- df[df$X > x, ]
pX <- round(nrow(Xx) / total, 4)
P(X>x | Y>y)
#get P(X>x | Y>y)
p1 <- round(nrow(Yy[Yy$X > x,]) / total, 4)
print(paste0("P(X>x | Y>y) = ", p1))
## [1] "P(X>x | Y>y) = 0.15"
P(X>x, Y>y)
p2 <- round(pX * pY, 4)
print(paste0("P(X>x, Y>y) = ", p2))
## [1] "P(X>x, Y>y) = 0.1875"
P(X<x | Y>y)
p3<-round(nrow(df[X<=x & Y>y,])/nrow(Yy), 4)
print(paste0("P(X<x | Y>y) = ", p3))
## [1] "P(X<x | Y>y) = 0.8"
c1<-nrow(df[X<x & Y<=y, ])
c2<-nrow(df[X <=x & Y>y, ])
c3<-c1+c2
c4<-nrow(df[X >x & Y<=y, ])
c5<-nrow(df[X >x & Y>y, ])
c6<-c4+c5
c7<-c1+c4
c8<-c2+c5
c9<-c3+c6
count.table<-matrix(round(c(c1,c2,c3,
c4,c5,c6,
c7,c8,c9),3), ncol=3, nrow=3, byrow=TRUE)
colnames(count.table) <- c("<=3d quartile",">3d quartile","Total")
rownames(count.table) <- c('<=1st quartile', '>1st quartile','Total')
count.table<-as.table(count.table)
kable(count.table)
<=3d quartile | >3d quartile | Total | |
---|---|---|---|
<=1st quartile | 3 | 12 | 15 |
>1st quartile | 2 | 3 | 5 |
Total | 5 | 15 | 20 |
Does splitting the training data in this fashion make them independent?
Let A
be the new variable counting those observations above the 1st quartile for X
, and let B
be the new variable counting those observations above the 1st quartile for Y
.
Does P(AB)=P(A)P(B)?
Check mathematically, and then evaluate by running a Chi Square test for association.
x.q1 <- quantile(X, probs = 0.25) # x.q1 = 4.35
y.q1 <- quantile(Y, probs = 0.25) # y.q1 = 19.425
A<-subset(df, df$X>x.q1)
B<-subset(df, df$Y>y.q1)
# P(AB)
p.ab <- nrow(subset(df, df$X>x.q1 & df$Y>y.q1)) / total
# P(A)P(B)
pa <- nrow(A) / total
pb <- nrow(B) / total
pa.pb <- pa*pb
p.ab == pa.pb
## [1] FALSE
Split the data into X above/below 1st quartile and Y above/below 1st quartile, does not make \(A\), \(B\) independent. We can take observations and subset them, however, this does not determine whether the probability of one event occurring affects that of different event occurring.
chisq.test(A, B)
##
## Pearson's Chi-squared test
##
## data: A
## X-squared = 21.076, df = 14, p-value = 0.0997
A chi-square test for independence compares two variables in a contingency table to see if they are related. The chi-squared statistic is a single number that tells you how much difference exists between your observed counts and the counts you would expect if there was no relationship at all in the population. In a more general sense, it tests to see whether distributions of categorical variables differ from each another. Chi squared is used for categorizations and we are using this for numerical variables – we are likely getting a very large contingency table.
The p-value is > 0.05, thus, we fail to reject the null hypothesis that the data are independant. There is not enough evidence to support the claim that the data is independent.