I was asked a question at the end of class about how to determine whether two variables are independent. There are two components to the answer. In class, I only emphasized the first component, so I’m sending this out to clarify the second component.

The first component is the definition: Two variables are independent when the distribution of one does not depend on the the other. In practice, we can check this by using the conditional distribution. If the probabilities of one variable remains fixed, regardless of whether we condition on another variable, then the two variables are independent. Otherwise, they are not.

The second component involves sampling: We do not often have access to the probabilities that generate a variable. We have access only to data attained through sampling. This means that there is some room for error. The observed conditional frequencies do not have to be exactly equal for the data to be independent: they need only be roughly equal. We can quantify what it means to be roughly equal, but, here, we’ll use a less rigorous, graphical approach.

An example is below.

Consider the following experiment. We randomly polled 50 females and 50 males, asking them their eye color. We recorded the data in the table below. We’re interested in knowing if eye color is independent of gender; that is, we’re intersted in knowing if the distribution of eye color is different for each gender.

##         eye_color
## gender   blue brown green
##   female    9    25    16
##   male     28    14     8

Looking at just this table, it may be hard to tell exactly what the relationship is between the variables eye color and gender, so we instead plot the data below.

It seems like, among the people we asked, many more males had blue eyes than females. Also, we observed the participant females to have more brown eyes and green eyes than the participant males. The differences seem to extreme enough that we can conclude that one’s eye color depend on one’s gender; that is, we can conclude that gender and eye color are dependent.

Let’s consider another sampling. Forget about the sampling we observed above. The experiment remains the same, though.

##         eye_color
## gender   blue brown green
##   female   18    14    18
##   male     18    11    21

Now, we can immediately see that the disparity between males and females in the count of blue eyed people does not exist in this data. Indeed, there seems to be no great disparity between genders on the other eye colors either. Let’s clarify our imegination with a bar graph.

The heights of the bars seem to be roughly the same. It does not seem like the eye color changes significantly depending on the gender; that is, it appears that the variables eye color and gender are independent.