For this excercise, I will look at the Motor Trend Car Road Tests (mtcars) dataset. The data includes 32 observtions of 11 variables, many of which are continuous (mpg, displacement, 1/4 mile time) and some of which are categorial (transmission, number of forward gears, and number of cylinders).
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Here we see that all of the data is numeric. Since we know that the number of gears (gear), transmission type (am), number of cyliners (cyl) and number of carburetors (carb) take on a very small number of distinct values, we can convert these columns to factors to treat each numerical value as a category (e.g. 2 cylinder cars vs. 4 cylinder cars).
mtcars$am <- as.factor(mtcars$am)
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$gear <- as.factor(mtcars$gear)
mtcars$carb <- as.factor(mtcars$carb)
We can look at the summary statistics for each of the variables using R’s built-in summary function. For the categorical variables, the summary function will return a count for each level of the factors. For transmission, we see that there are 19 for “0” (automatic transmission) and 13 for “1” (manual transmission).
summary(mtcars)
## mpg cyl disp hp drat
## Min. :10.40 4:11 Min. : 71.1 Min. : 52.0 Min. :2.760
## 1st Qu.:15.43 6: 7 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080
## Median :19.20 8:14 Median :196.3 Median :123.0 Median :3.695
## Mean :20.09 Mean :230.7 Mean :146.7 Mean :3.597
## 3rd Qu.:22.80 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920
## Max. :33.90 Max. :472.0 Max. :335.0 Max. :4.930
## wt qsec vs am gear carb
## Min. :1.513 Min. :14.50 Min. :0.0000 0:19 3:15 1: 7
## 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000 1:13 4:12 2:10
## Median :3.325 Median :17.71 Median :0.0000 5: 5 3: 3
## Mean :3.217 Mean :17.85 Mean :0.4375 4:10
## 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 6: 1
## Max. :5.424 Max. :22.90 Max. :1.0000 8: 1
If we just wanted to find the frequency of a categorical variable, we could run a table on just that column:
table(mtcars$gear)
##
## 3 4 5
## 15 12 5
If we wanted to see how the number of gears varies by transmission type, we can run a table on two categorical variables. This will demonstrate that within our dataset, cars with automatic transmissions tend to have fewer gears.
table(mtcars$am, mtcars$gear)
##
## 3 4 5
## 0 15 4 0
## 1 0 8 5
boxplot(mtcars$mpg,main="MPG distribution in MTCARS data", ylab="MPG")
The boxplot shows the data shown above in the summary function. The median is about 19 mpg, the minimum 10, the maximum 34, and fifty-percent - shown as the floor and ceiling of the box - between 15-23 mpg.
This next scatterplot shows the relationship between two numeric variables - mpg and hp. When we think about automative design, our intuition might tell us that fuel efficiency is inversely related to horsepower. The following graph demonstrates that this is a reasonable asumption.
plot(mtcars$mpg ~ mtcars$hp, main= "HP vs. MPG", xlab="Horsepower",ylab="Miles/Gallon")
abline(lm(mtcars$mpg~mtcars$hp), col="red") # Add the regression line