1. Choose and load any R dataset (except for diamonds!) that has at least two numeric variables and at least two categorical variables. Identify which variables in your data set are numeric, and which are categorical (factors).
  2. Generate summary level descriptive statistics: Show the mean, median, 25th and 75th quartiles, min, and max for each of the applicable variables in your data set.
  3. Determine the frequency for one of the categorical variables.
  4. Determine the frequency for one of the categorical variables, by a different categorical variable.
  5. Create a graph for a single numeric variable.
  6. Create a scatterplot of two numeric variables.

Packages used: datasets, ggplot2, dplyr

data(mtcars)
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Fuel economy (mpg), displacement (disp), horsepower (hp), rear axle ratio (drat), weight (wt), and quarter mile time (qsec) are all appropriately stored as numeric.

Number of cynlinders (cyl), ? (vs), transmission type (am), number of forward gears (gear) and number of carburetors (carb) are categorical variables currently stored as numeric. We can change them to the more appropriate class factor.

mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$vs <- as.factor(mtcars$vs)
mtcars$am <- as.factor(mtcars$am)
mtcars$gear <- as.factor(mtcars$gear)
mtcars$carb <- as.factor(mtcars$carb)

I’ll change the transmission type to a text factor as this dataset doesn’t have that kind of categorical value.

mtcars$am <-factor(mtcars$am, levels=c(1,0), labels=c("automatic", "manual"))

The rest of the categorical vectors can be ordered.

mtcars$cyl <- factor(mtcars$cyl, ordered = TRUE)
mtcars$vs <- factor(mtcars$vs, ordered = TRUE)
mtcars$gear <- factor(mtcars$gear, ordered = TRUE)
mtcars$carb <- factor(mtcars$carb, ordered = TRUE)
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Ord.factor w/ 3 levels "4"<"6"<"8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : Ord.factor w/ 2 levels "0"<"1": 1 1 2 2 1 2 1 2 2 2 ...
##  $ am  : Factor w/ 2 levels "automatic","manual": 1 1 1 2 2 2 2 2 2 2 ...
##  $ gear: Ord.factor w/ 3 levels "3"<"4"<"5": 2 2 2 1 1 1 1 2 2 2 ...
##  $ carb: Ord.factor w/ 6 levels "1"<"2"<"3"<"4"<..: 4 4 1 1 2 1 4 2 2 4 ...

The summary function calculates mean, median, 25th and 75th quartiles, min, and max for the numeric vectors. Using the select function from the dplyr we can limit to numeric vectors from the mtcars data frame.

cars.numeric <-select(mtcars, mpg, disp, hp, drat, wt, qsec)
summary(cars.numeric)
##       mpg             disp             hp             drat      
##  Min.   :10.40   Min.   : 71.1   Min.   : 52.0   Min.   :2.760  
##  1st Qu.:15.43   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:3.080  
##  Median :19.20   Median :196.3   Median :123.0   Median :3.695  
##  Mean   :20.09   Mean   :230.7   Mean   :146.7   Mean   :3.597  
##  3rd Qu.:22.80   3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.920  
##  Max.   :33.90   Max.   :472.0   Max.   :335.0   Max.   :4.930  
##        wt             qsec      
##  Min.   :1.513   Min.   :14.50  
##  1st Qu.:2.581   1st Qu.:16.89  
##  Median :3.325   Median :17.71  
##  Mean   :3.217   Mean   :17.85  
##  3rd Qu.:3.610   3rd Qu.:18.90  
##  Max.   :5.424   Max.   :22.90

Finding the frequency of a categorical variable: transmission type (am).

table(mtcars$am)
## 
## automatic    manual 
##        13        19

Finding the frequency of one categorical variable based on another: carburetor by cylinders.

table(mtcars$carb, mtcars$cyl, dnn=c("No. of Carburetors", "No. of Cylinders"))
##                   No. of Cylinders
## No. of Carburetors 4 6 8
##                  1 5 2 0
##                  2 6 0 4
##                  3 0 0 3
##                  4 0 4 6
##                  6 0 1 0
##                  8 0 0 1

Create a graph for a single numeric variable. I decided to make a boxplot of fuel economy (mpg) grouped by number of cynlinders (cyl). As you might expect, eight cylinder engines have terrible fuel economy, whereas four cynlinder engines are higher.

boxplot(mtcars$mpg ~ mtcars$cyl, main = "Fuel Economy (MPG) Boxplot", ylab= "Miles Per Gallon", xlab = "Number of Cylinders")

Create a scatterplot of two numeric variables. I created a scatteplot showing how quartermile time generally decreases with increasing horsepower. The grey area represents a 95% confidence region.

ggplot(data = mtcars, aes(x=hp, y=qsec))+
  geom_point(pch=17, color="blue", size=2)+
  geom_smooth(method="lm", color="red", linetype=2)+
  labs(title="Quarter-mile Time vs. Horsepower in the Motor Trend Car Dataset", x="Horsepower", y="Quarter Mile Time (sec)")