For this excercise, I will look at the Motor Trend Car Road Tests (mtcars) dataset. The data includes 32 observtions of 11 variables, many of which are continuous (mpg, displacement, 1/4 mile time) and some of which are categorial (transmission, number of forward gears, and number of cylinders).

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Here we see that all of the data is numeric. Since we know that the number of gears (gear), transmission type (am), number of cyliners (cyl) and number of carburetors (carb) take on a very small number of distinct values, we can convert these columns to factors to treat each numerical value as a category (e.g. 2 cylinder cars vs. 4 cylinder cars).

mtcars$am <- as.factor(mtcars$am)
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$gear <- as.factor(mtcars$gear)
mtcars$carb <- as.factor(mtcars$carb)

We can look at the summary statistics for each of the variables using R’s built-in summary function. For the categorical variables, the summary function will return a count for each level of the factors. For transmission, we see that there are 19 for “0” (automatic transmission) and 13 for “1” (manual transmission).

summary(mtcars)
##       mpg        cyl         disp             hp             drat      
##  Min.   :10.40   4:11   Min.   : 71.1   Min.   : 52.0   Min.   :2.760  
##  1st Qu.:15.43   6: 7   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:3.080  
##  Median :19.20   8:14   Median :196.3   Median :123.0   Median :3.695  
##  Mean   :20.09          Mean   :230.7   Mean   :146.7   Mean   :3.597  
##  3rd Qu.:22.80          3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.920  
##  Max.   :33.90          Max.   :472.0   Max.   :335.0   Max.   :4.930  
##        wt             qsec             vs         am     gear   carb  
##  Min.   :1.513   Min.   :14.50   Min.   :0.0000   0:19   3:15   1: 7  
##  1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000   1:13   4:12   2:10  
##  Median :3.325   Median :17.71   Median :0.0000          5: 5   3: 3  
##  Mean   :3.217   Mean   :17.85   Mean   :0.4375                 4:10  
##  3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000                 6: 1  
##  Max.   :5.424   Max.   :22.90   Max.   :1.0000                 8: 1

If we just wanted to find the frequency of a categorical variable, we could run a table on just that column:

table(mtcars$gear)
## 
##  3  4  5 
## 15 12  5

If we wanted to see how the number of gears varies by transmission type, we can run a table on two categorical variables. This will demonstrate that within our dataset, cars with automatic transmissions tend to have fewer gears.

table(mtcars$am, mtcars$gear)
##    
##      3  4  5
##   0 15  4  0
##   1  0  8  5
boxplot(mtcars$mpg,main="MPG distribution in MTCARS data", ylab="MPG")

The boxplot shows the data shown above in the summary function. The median is about 19 mpg, the minimum 10, the maximum 34, and fifty-percent - shown as the floor and ceiling of the box - between 15-23 mpg.

This next scatterplot shows the relationship between two numeric variables - mpg and hp. When we think about automative design, our intuition might tell us that fuel efficiency is inversely related to horsepower. The following graph demonstrates that this is a reasonable asumption.

plot(mtcars$mpg ~ mtcars$hp, main= "HP vs. MPG", xlab="Horsepower",ylab="Miles/Gallon")
abline(lm(mtcars$mpg~mtcars$hp), col="red") # Add the regression line