Each one of the questions asks about the relationship between a categorical variable and a quantitative variable. In this lesson, we will consider graphical and numerical summaries for such a relationship.
The basic idea is to apply the summaries that were appropriate for one quantitative variable and apply it for each category of interest.
The mtcars dataset contains data from the 1974 Motor Trend US magazine. Two of the variables measured are mpg (miles/gallon) and am (Transmission 0 = automatic, 1 = manual). Is there a relationship between miles/gallon and transmission type? Here is some of the data:
mpg am
Mazda RX4 21.0 1
Mazda RX4 Wag 21.0 1
Datsun 710 22.8 1
Hornet 4 Drive 21.4 0
Hornet Sportabout 18.7 0
Valiant 18.1 0
Example 1: Identify the observational units, the variables and the types of variables for the following examples from above.
We already saw the boxplot function. In that lesson, we used the boxplot function with one quantitative variable that gave us one boxplot. We are now going to use the same boxplot function with two variables which will result in multiple boxplots. This happens a lot in R: the same function with different types of input will result in different types of output.
To make a side by side boxplot of miles/gallon for each type of transmission, we start with the quantitative variable then use the tilde symbol ~ and then end with the categorical variable. It’s very important to put the quantitative variable first and the categorical variable second.
> boxplot(mtcars$mpg~mtcars$am)
We can make this plot prettier by adding some titles and labels. Also, we are going to change the 0 and 1 labels to A (for automatic) and M (for manual).
> boxplot(mtcars$mpg~mtcars$am,horizontal=TRUE,xlab="MPG",main="Miles/Gallon by Transmission",names=c("A","M"))
What do the boxplots show you about the relationship between miles/gallon and transmission type? It looks like cars with manual transmission get better gas mileage. The median gas mileage for manual transmission cars is pretty close to the maximum gas mileage for automatic transmission cars so half the manual transmission cars do better than almost all the automatic transmission cars.
Later in the semester we will return to this example and formally test the hypothesis about whether automatic cars get lower gas mileage than manual cars.
The boxplots look fairly symmetric so we are going to find the mean and standard deviation of miles/gallon for each transmission type using the tapply function. The tapply function needs three inputs: a quantitative variable, a categorical variable, and a function to be applied to the quantitative variable for each category.
> tapply(mtcars$mpg,mtcars$am,mean)
0 1
17.14737 24.39231
> tapply(mtcars$mpg,mtcars$am,sd)
0 1
3.833966 6.166504