Do men and women watch the Superbowl for different reasons (commercials, half time show, football)?
Does political party membership (Democrat, Republican, Independent) affect opinion on the death penalty?
Is customer age (<18, 18-35, >35) important in deciding to subscribe to an online music service?

Each one of the questions asks about the relationship between two categorical variables. In this lesson, we will consider graphical and numerical summaries for such a relationship.

In this lesson, we will use the pulse dataset. Eight variables were collected on 91 college students. Each student was asked to flip a coin. Those whose coin came up heads were asked to run in place for a minute or so. All students measured their own pulse rate both before and after the running activity took place. We will investigate whether there is a relationship between whether a student smoked or not, Smoke, and the student’s gender, Sex.

Here is the some of the data:

  PuBefore PuAfter Ran Smoke    Sex Height Weight ActLev
1       54      56  no   yes   male     69    145      2
2       54      50  no    no   male     69    160      2
3       58      70 yes    no   male     72    145      2
4       58      58  no    no   male     66    135      3
5       58      56  no    no female     67    125      2
6       60      76 yes    no   male     71    170      3

Example 1: Identify the observational units, the variables and the types of variables for the following examples from above.

Do men and women watch the Superbowl for different reasons (commercials, half time show, football)?
Does political party membership (Democrat, Republican, Independent) affect opinion on the death penalty?
Is customer age (<18, 18-35, >35) important in deciding to subscribe to an online music service?
Is there a relationship between whether a student smokes and the student’s gender?

Click For Answer

The observational units are Superbowl viewers. The variables are gender (categorical) and reason for watching (categorical).
The observational units are registered voters. The variables are political party (categorical) and opinion (categorical).
The observational units are customers. The variables are age (categorical) and decision to subscribe (categorical). Note that age is usually quantitative but by grouping into categories it is now considered categorical
The observational units are college students. The variables are smoking status (categorical) and gender (categorical).

Contingency table

A contingency table gives us the counts in each possible combination of the two variables.

In lesson 1.2, we made a frequency table with the table command in R as a way of summarizing one categorical variable. When we have two categorical variables, we will once again use the table command in R to create a contingency table.

Notice that we can use the table command with different inputs (one categorical variable or two categorical variables) and get different outputs (just like the boxplot command from Lesson 2.1).

The first variable listed in the table command will make up the rows of the contingency table and the second variable listed will make up the columns. This will be important later on when we make a stacked barchart.

The R commands below creates a contingency table of Smoking Status and Sex with Smoking Status as the rows. The contingency table is stored in an object called t1.

> t1 <- table(pulse$Smoke,pulse$Sex)
> t1

     
      female male
  no      27   37
  yes      8   19

We can see, for example, that there are 27 students who are nonsmoking females and there are 19 male smokers.

Marginal distributions

The marginal distribution is a frequency table of either the row or column variable in the contingency table. It is essentially looking at each categorical variable separately. The R command margin.table gives the marginal distribution from the contingency table. The margin = 1 subcommand asks for the marginal distribution of the row variable and the margin = 2 subcommand asks for the marginal distribution of the column variable.

> margin.table(t1,margin=1)


 no yes 
 64  27

> margin.table(t1,margin=2)


female   male 
    35     56

Conditional distributions

The conditional distribution shows the distribution of one variable given one value of the other variable. It involves comparing percentages instead of counts. However, there are several types of percentages that we can compute: overall percentages, column percentages and row percentages.

The overall percentage is where each value in the contingency table is divided by the total number of observations (in this case 91). This can be found using the prop.table command in R

> prop.table(t1)

     
          female       male
  no  0.29670330 0.40659341
  yes 0.08791209 0.20879121

For example, 29.6% of the students were females that don’t smoke and 20.9% of the students were males that do smoke. Notice that adding up all the entries sums to 1. We say “of the students” because we are dividing everything by all 91 students.

The row percentage is where each value in the contingency table is divided by the row totals (i.e.the marginal distribution of smoking status from the previous section). This can be found using the prop.table command with the margin = 1 subcommand in R

> prop.table(t1,margin=1)

     
         female      male
  no  0.4218750 0.5781250
  yes 0.2962963 0.7037037

For example, 42.2% of the nonsmokers are female and 70.4% of the smokers are male. Each row sums to 1. We say “of the nonsmokers” because we divided the nonsmoker row by the total number of nonsmokers and “of the smokers” because we divided the smoker row by the total number of smokers. This is called the conditional distribution of gender given smoking status. Each row shows us the whole distribution of gender (female/male) for each smoking status separately.

The column percentage is where each value in the contingency table is divided by the column totals (i.e.the marginal distribution of gender from the previous section). This can be found using the prop.table command with the margin = 2 subcommand in R

> prop.table(t1,margin=2)

     
         female      male
  no  0.7714286 0.6607143
  yes 0.2285714 0.3392857

For example, 77.1% of the females smoke and 66.1% of the males smoke. Each column sums to 1. We say “of the females” because we divided the female column by the total number of females and “of the males” because we divided the male column by the total number of males. This is called the conditional distribution of smoking status given gender. Each column shows us the whole distribution of smoking status (yes/no) for each gender separately.

The conditional distribution of smoking status given gender allows us to compare the rate of smoking in the two genders and is the key to describing the relationship between the two categorical variables. We can see that there is a slightly higher rate of smoking for men (34%) compared to women (23%). To see this even more clearly, we can make a stacked barchart of this conditional distribution.

Stacked barchart

To make a stacked barchart, use the barplot command on the conditional distribution of smoking status given gender in the previous section. This only works on column percentages since the barplot command makes each bar out of each column.

> barplot(prop.table(t1,margin=2),main="Smoking Status by Gender")

To add a legend and some color to the barplot

> barplot(prop.table(t1,margin=2),legend=TRUE,col=c("green","blue"),main = "Smoking Status by Gender")

The stacked barchart shows that the distribution of smokers is different for men and women, i.e. the blue and green parts are different sizes for the two genders. This difference in the two distributions indicates that there may be a relationship between gender and smoking status.

If the green and blue areas were identical for each gender then the distributions would be the same indicating no relationship between gender and smoking status.

We will return to this type of problem when we study hypothesis testing which will allow us to make more definitive conclusions about the relationship between these two variables.

Lesson 2.2

Contingency table

Marginal distributions

Conditional distributions

Stacked barchart