Each one of the questions asks about the relationship between two categorical variables. In this lesson, we will consider graphical and numerical summaries for such a relationship.
In this lesson, we will use the pulse dataset. Eight variables were collected on 91 college students. Each student was asked to flip a coin. Those whose coin came up heads were asked to run in place for a minute or so. All students measured their own pulse rate both before and after the running activity took place. We will investigate whether there is a relationship between whether a student smoked or not, Smoke, and the student’s gender, Sex.
Here is the some of the data:
PuBefore PuAfter Ran Smoke Sex Height Weight ActLev
1 54 56 no yes male 69 145 2
2 54 50 no no male 69 160 2
3 58 70 yes no male 72 145 2
4 58 58 no no male 66 135 3
5 58 56 no no female 67 125 2
6 60 76 yes no male 71 170 3
Example 1: Identify the observational units, the variables and the types of variables for the following examples from above.
A contingency table gives us the counts in each possible combination of the two variables.
In lesson 1.2, we made a frequency table with the table command in R as a way of summarizing one categorical variable. When we have two categorical variables, we will once again use the table command in R to create a contingency table.
Notice that we can use the table command with different inputs (one categorical variable or two categorical variables) and get different outputs (just like the boxplot command from Lesson 2.1).
The first variable listed in the table command will make up the rows of the contingency table and the second variable listed will make up the columns. This will be important later on when we make a stacked barchart.
The R commands below creates a contingency table of Smoking Status and Sex with Smoking Status as the rows. The contingency table is stored in an object called t1.
> t1 <- table(pulse$Smoke,pulse$Sex)
> t1
female male
no 27 37
yes 8 19
We can see, for example, that there are 27 students who are nonsmoking females and there are 19 male smokers.
The marginal distribution is a frequency table of either the row or column variable in the contingency table. It is essentially looking at each categorical variable separately. The R command margin.table gives the marginal distribution from the contingency table. The margin = 1 subcommand asks for the marginal distribution of the row variable and the margin = 2 subcommand asks for the marginal distribution of the column variable.
> margin.table(t1,margin=1)
no yes
64 27
> margin.table(t1,margin=2)
female male
35 56
The conditional distribution shows the distribution of one variable given one value of the other variable. It involves comparing percentages instead of counts. However, there are several types of percentages that we can compute: overall percentages, column percentages and row percentages.
The overall percentage is where each value in the contingency table is divided by the total number of observations (in this case 91). This can be found using the prop.table command in R
> prop.table(t1)
female male
no 0.29670330 0.40659341
yes 0.08791209 0.20879121
For example, 29.6% of the students were females that don’t smoke and 20.9% of the students were males that do smoke. Notice that adding up all the entries sums to 1. We say “of the students” because we are dividing everything by all 91 students.
The row percentage is where each value in the contingency table is divided by the row totals (i.e.the marginal distribution of smoking status from the previous section). This can be found using the prop.table command with the margin = 1 subcommand in R
> prop.table(t1,margin=1)
female male
no 0.4218750 0.5781250
yes 0.2962963 0.7037037
For example, 42.2% of the nonsmokers are female and 70.4% of the smokers are male. Each row sums to 1. We say “of the nonsmokers” because we divided the nonsmoker row by the total number of nonsmokers and “of the smokers” because we divided the smoker row by the total number of smokers. This is called the conditional distribution of gender given smoking status. Each row shows us the whole distribution of gender (female/male) for each smoking status separately.
The column percentage is where each value in the contingency table is divided by the column totals (i.e.the marginal distribution of gender from the previous section). This can be found using the prop.table command with the margin = 2 subcommand in R
> prop.table(t1,margin=2)
female male
no 0.7714286 0.6607143
yes 0.2285714 0.3392857
For example, 77.1% of the females smoke and 66.1% of the males smoke. Each column sums to 1. We say “of the females” because we divided the female column by the total number of females and “of the males” because we divided the male column by the total number of males. This is called the conditional distribution of smoking status given gender. Each column shows us the whole distribution of smoking status (yes/no) for each gender separately.
The conditional distribution of smoking status given gender allows us to compare the rate of smoking in the two genders and is the key to describing the relationship between the two categorical variables. We can see that there is a slightly higher rate of smoking for men (34%) compared to women (23%). To see this even more clearly, we can make a stacked barchart of this conditional distribution.
To make a stacked barchart, use the barplot command on the conditional distribution of smoking status given gender in the previous section. This only works on column percentages since the barplot command makes each bar out of each column.
> barplot(prop.table(t1,margin=2),main="Smoking Status by Gender")
To add a legend and some color to the barplot
> barplot(prop.table(t1,margin=2),legend=TRUE,col=c("green","blue"),main = "Smoking Status by Gender")
The stacked barchart shows that the distribution of smokers is different for men and women, i.e. the blue and green parts are different sizes for the two genders. This difference in the two distributions indicates that there may be a relationship between gender and smoking status.
If the green and blue areas were identical for each gender then the distributions would be the same indicating no relationship between gender and smoking status.
We will return to this type of problem when we study hypothesis testing which will allow us to make more definitive conclusions about the relationship between these two variables.