Packages used: datasets, ggplot2, dplyr
data(mtcars)
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Fuel economy (mpg), displacement (disp), horsepower (hp), rear axle ratio (drat), weight (wt), and quarter mile time (qsec) are all appropriately stored as numeric.
Number of cynlinders (cyl), ? (vs), transmission type (am), number of forward gears (gear) and number of carburetors (carb) are categorical variables currently stored as numeric. We can change them to the more appropriate class factor.
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$vs <- as.factor(mtcars$vs)
mtcars$am <- as.factor(mtcars$am)
mtcars$gear <- as.factor(mtcars$gear)
mtcars$carb <- as.factor(mtcars$carb)
I’ll change the transmission type to a text factor as this dataset doesn’t have that kind of categorical value.
mtcars$am <-factor(mtcars$am, levels=c(1,0), labels=c("automatic", "manual"))
The rest of the categorical vectors can be ordered.
mtcars$cyl <- factor(mtcars$cyl, ordered = TRUE)
mtcars$vs <- factor(mtcars$vs, ordered = TRUE)
mtcars$gear <- factor(mtcars$gear, ordered = TRUE)
mtcars$carb <- factor(mtcars$carb, ordered = TRUE)
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : Ord.factor w/ 3 levels "4"<"6"<"8": 2 2 1 2 3 2 3 1 1 2 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : Ord.factor w/ 2 levels "0"<"1": 1 1 2 2 1 2 1 2 2 2 ...
## $ am : Factor w/ 2 levels "automatic","manual": 1 1 1 2 2 2 2 2 2 2 ...
## $ gear: Ord.factor w/ 3 levels "3"<"4"<"5": 2 2 2 1 1 1 1 2 2 2 ...
## $ carb: Ord.factor w/ 6 levels "1"<"2"<"3"<"4"<..: 4 4 1 1 2 1 4 2 2 4 ...
The summary function calculates mean, median, 25th and 75th quartiles, min, and max for the numeric vectors. Using the select function from the dplyr we can limit to numeric vectors from the mtcars data frame.
cars.numeric <-select(mtcars, mpg, disp, hp, drat, wt, qsec)
summary(cars.numeric)
## mpg disp hp drat
## Min. :10.40 Min. : 71.1 Min. : 52.0 Min. :2.760
## 1st Qu.:15.43 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080
## Median :19.20 Median :196.3 Median :123.0 Median :3.695
## Mean :20.09 Mean :230.7 Mean :146.7 Mean :3.597
## 3rd Qu.:22.80 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920
## Max. :33.90 Max. :472.0 Max. :335.0 Max. :4.930
## wt qsec
## Min. :1.513 Min. :14.50
## 1st Qu.:2.581 1st Qu.:16.89
## Median :3.325 Median :17.71
## Mean :3.217 Mean :17.85
## 3rd Qu.:3.610 3rd Qu.:18.90
## Max. :5.424 Max. :22.90
Finding the frequency of a categorical variable: transmission type (am).
table(mtcars$am)
##
## automatic manual
## 13 19
Finding the frequency of one categorical variable based on another: carburetor by cylinders.
table(mtcars$carb, mtcars$cyl, dnn=c("No. of Carburetors", "No. of Cylinders"))
## No. of Cylinders
## No. of Carburetors 4 6 8
## 1 5 2 0
## 2 6 0 4
## 3 0 0 3
## 4 0 4 6
## 6 0 1 0
## 8 0 0 1
Create a graph for a single numeric variable. I decided to make a boxplot of fuel economy (mpg) grouped by number of cynlinders (cyl). As you might expect, eight cylinder engines have terrible fuel economy, whereas four cynlinder engines are higher.
boxplot(mtcars$mpg ~ mtcars$cyl, main = "Fuel Economy (MPG) Boxplot", ylab= "Miles Per Gallon", xlab = "Number of Cylinders")
Create a scatterplot of two numeric variables. I created a scatteplot showing how quartermile time generally decreases with increasing horsepower. The grey area represents a 95% confidence region.
ggplot(data = mtcars, aes(x=hp, y=qsec))+
geom_point(pch=17, color="blue", size=2)+
geom_smooth(method="lm", color="red", linetype=2)+
labs(title="Quarter-mile Time vs. Horsepower in the Motor Trend Car Dataset", x="Horsepower", y="Quarter Mile Time (sec)")