Step 1 - Exploratory data analysis

Read in data from the text file.

This text file is simply a copy-paste of the data in the assignment. It is also available here:

https://raw.githubusercontent.com/heathergeiger/Data605_final/master/data605_problem1_variables.txt

data <- read.table("data605_problem1_variables.txt",header=TRUE)

data

##      Y1   Y2   Y3   Y4   X1   X2   X3   X4
## 1  20.3 20.8 28.4 20.2  9.3  7.4  9.5  9.3
## 2  19.1 14.6 21.5 18.6  4.1  6.4  3.7 12.4
## 3  19.3 18.0 20.8 22.6 22.4  8.5 11.7 19.9
## 4  20.9  7.3 22.2 11.4  9.1  9.5  7.4  6.9
## 5  22.0 19.4 21.6 23.6 15.8 11.8  5.3 -1.0
## 6  23.5 13.5 21.8 24.0  7.1  8.8  7.4 10.6
## 7  13.8 14.7 25.2 26.0 15.9  8.4  7.4  6.4
## 8  18.8 15.3 22.5 26.8  6.9  5.1  8.6 10.6
## 9  20.9 12.6 21.1 19.7 16.0 11.4  9.1  1.2
## 10 18.6 13.0 21.7 22.7  6.7 15.1 11.4  7.7
## 11 22.3 13.1 21.4 16.8  8.2 12.6  8.4 15.5
## 12 17.6 10.3 20.8 20.2 16.0  8.0  7.3  6.9
## 13 20.8 14.9 23.0 21.7  6.4 10.3 11.3 13.7
## 14 28.7 14.8 17.4 20.9 11.8 10.4  4.4  3.7
## 15 15.2 16.2 21.3 26.9  3.5  9.5  9.3  4.4
## 16 20.9 15.7 15.1 16.3 21.7  9.5 10.9 11.5
## 17 18.4 16.3 17.8 19.9 12.2 15.1 10.9  4.2
## 18 10.3 11.5 26.4 15.5  9.3  6.6  7.7 13.9
## 19 26.3 12.2 21.6 26.5  8.0 15.4  7.7 12.9
## 20 28.1 11.8 22.5 21.7  6.2  8.2 11.5  1.2

With only 20 rows, it’s a bit hard to tell whether or not the data is normally distributed. But we can use density plots and stripcharts to get a sense of the data.

data_as_vector = c()

par(mfrow=c(2,4))

for(i in 1:ncol(data))
{
data_as_vector = c(data_as_vector,data[,i])
plot(density(data[,i])$x,density(data[,i])$y,type="l",xlab="Values",ylab="Density of observations",main=colnames(data)[i])
}

par(mfrow=c(1,1))
stripchart(data_as_vector ~ rep(colnames(data),each=nrow(data)),method="jitter",pch=21,xlab="")

Variables look roughly normally distributed.

We also see that the Y variables generally tend to be a bit larger than the X variables.

Let’s choose X4 and Y4 as our variables for further analysis.

Plot a correlation scatterplot.

plot(data$X4,data$Y4,xlab="X4",ylab="Y4")

These two variables do not appear very correlated.

Step 2 - Calculating quartiles

First, let’s review how to calculate quartiles. This website (http://web.mnstate.edu/peil/MDEV102/U4/S36/S363.html) gives a nice overview.

We first divide the observations into the lower half (observations 1-10 ordering by the variable of interest) and upper half (observations 11-20).

Then, the first quartile is the median of the lower half, while the third quartile is the median of the upper half.

This gives the first quartile as being the mean of values 5 and 6 here. While the third quartile is the mean of values 15 and 16.

We can also obtain quartiles calculated this way by running R’s quantile function with argument type=2.

first_quartile_X4 = quantile(data$X4,type=2)[2]
third_quartile_Y4 = quantile(data$Y4,type=2)[4]

#Save this for later question that requires it.
first_quartile_Y4 = quantile(data$Y4,type=2)[2]

Step 3 - Calculating probabilities

Final step is just to calculate the probabilities.

We can do this by looking at the exact number of observations meeting conditions.

Question A

P(X>x | Y>y) = P(X>x & Y>y) + P(X>x & Y <= y) + P(X <= x & Y>y)

length(which(data$X4 > first_quartile_X4 & data$Y4 > third_quartile_Y4))

## [1] 5

length(which(data$X4 > first_quartile_X4 & data$Y4 <= third_quartile_Y4))

## [1] 10

length(which(data$X4 <= first_quartile_X4 & data$Y4 > third_quartile_Y4))

## [1] 0

Compare to the individual probabilities for X>x and Y>y alone, which are 15/20 and 5/20.

So we would expect the first probability of the three sub-probabilities to have odds of (15/20) x (4/20) = 3.75/20, the second to have odds of (15/20) x (15/20) = 11.25/20, and the third to have odds of (5/20) x (5/20) = 1.25/20 if X4 and Y4 are not correlated and thus the events X>x and Y>y are independent.

The actual odds we see are pretty close to the odds we would expect if X>x and Y>y are independent.

Anyway, the probability of P(X>x |Y>y) is equal to the sum of the probability of the three events we calculated, so (5 + 10 + 0)/20 = 15/20 = .75.

Question B

P(X>x, Y>y)

We calculated this as part of question A. p = 5/20 = 0.25.

Question C

P(Xy) = P(X y) + P(X= x & Y > y)

Check if all values in X4 and Y4 are unique.

length(unique(data$X4))

## [1] 17

length(unique(data$Y4))

## [1] 18

first_quartile_X4

## 25% 
## 4.3

data$X4[duplicated(data$X4) == TRUE]

## [1] 10.6  6.9  1.2

third_quartile_Y4

##  75% 
## 23.8

data$Y4[duplicated(data$Y4) == TRUE]

## [1] 20.2 21.7

No, they are not all unique. However, the duplicated values are not equal to the quartile values we are looking at.

Since there are no values in X4 exactly equal to the first quartile and no values in Y4 exactly equal to the third quartile, the <= or >= operators are really the same as just < or > here.

So, we actually already calculated two of the sub-probabilities here, and the third can be inferred by process of elimination.

P(X y) = P(X <= x & Y>y) = 0/20
P(X<x & Y <= y) = P(X<x & Y<y) = (20 - 5 - 10 - 0)/20 = 5/20 (vs. 3.75/20 expected by chance, so pretty close)
P(X >= x & Y > y) = P(X > x & Y > y) = 5/20

The probability of P(Xy) = 0.5 or 1/2 (10/20).

Step 4 - Counts table

Make a table of counts where we will in number of observations meeting criteria X <= x vs. X > x and Y <= y vs. Y > y.

counts_table = data.frame(y.lt.or.equal.3d = c(5,10,15),y.gt.3d = c(0,5,5),total = c(5,15,20))

rownames(counts_table) = c("X <= x","X > x","Total")
colnames(counts_table) = c("Y <= y","Y > y","Total")

counts_table

##        Y <= y Y > y Total
## X <= x      5     0     5
## X > x      10     5    15
## Total      15     5    20

Extra step - check independence of X > 1st quartile, Y > 1st quartile

We calculate a similar counts table to what we did for the previous step, except now look at the 1st quartile of Y instead of the third quartile.

First, check that no values in Y are exactly equal to the first quartile value.

length(which(data$Y4 == first_quartile_Y4))

## [1] 0

length(which(data$Y4 > first_quartile_Y4))

## [1] 15

length(which(data$Y4 < first_quartile_Y4))

## [1] 5

No values exactly equal, so that’s good.

Now, fill in counts table. Don’t need total row/column this time.

length(which(data$X4 < first_quartile_X4 & data$Y4 < first_quartile_Y4))

## [1] 0

length(which(data$X4 > first_quartile_X4 & data$Y4 < first_quartile_Y4))

## [1] 5

length(which(data$X4 < first_quartile_X4 & data$Y4 > first_quartile_Y4))

## [1] 5

length(which(data$X4 > first_quartile_X4 & data$Y4 > first_quartile_Y4))

## [1] 10

counts_table_pt2 <- data.frame(y.lt.1d = c(0,5),y.gt.1d = c(5,10))

rownames(counts_table_pt2) = c("X < 1st quartile","X > 1st quartile")
colnames(counts_table_pt2) = c("Y < 1st quartile","Y > 1st quartile")

counts_table_pt2

##                  Y < 1st quartile Y > 1st quartile
## X < 1st quartile                0                5
## X > 1st quartile                5               10

Compare this to the table we would expect from calculated probabilities. Note, the fractional values are not actually possible, as can’t have fractional observations.

calculated_probability_counts <- data.frame(y.lt.1d = c(1.25,3.75),y.gt.1d = c(3.75,11.25))

rownames(calculated_probability_counts) = c("X < 1st quartile","X > 1st quartile")
colnames(calculated_probability_counts) = c("Y < 1st quartile","Y > 1st quartile")

calculated_probability_counts

##                  Y < 1st quartile Y > 1st quartile
## X < 1st quartile             1.25             3.75
## X > 1st quartile             3.75            11.25

Not exactly the same even with rounding compared to the actual values, but pretty close!

Run a chi-squared test.

chisq.test(counts_table_pt2)

## Warning in chisq.test(counts_table_pt2): Chi-squared approximation may be
## incorrect

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  counts_table_pt2
## X-squared = 0.8, df = 1, p-value = 0.3711

Now, let’s interpret these results.

Let’s think about if you have two different variables, and then however many levels of each of those (two in this example).

If the two variables are not associated, the amount of times you would expect to see each different combination of levels would be roughly equal to the multiplication of the probabilities of each level individually.

I say “roughly equal” because with random sampling, you would expect some variation from the exact probabilities.

The chi-squared test compares the actual probabilities of seeing the levels together compared to what would be expected by chance. If the p-value is low, we say that there is a low probability that the combinations we see are due to chance, and thus there is a high chance that the variables are associated.

Here, the p-value is high, suggesting instead that the two variables are not associated.

This is similar to what we would have predicted in earlier sections.

Data 605 Final Problem 1

Heather Geiger