Here is some basic code and examples of Exploratory Data Analysis (EDA).

We’ll explore some of the capabilities of R using the cars dataset from the openintro package.

library(openintro)
data(cars)

Displaying Raw Data

The function names() shows the variable names of this dataset and the function head() outputs the first n lines of the dataset.

names(cars)
head(cars, n = 3)

the function str() displays the data frame structure. It shows the number of observations and variables, lists the variables and their types, and shows some of the elements of each variable.

The functions dim(), nrows(), and ncol() show the dimension, number of rows, and number of columns in the data frame, respectively.

dim(cars)
nrows(cars)
ncol(cars)

To access one of the variables, use $.

cars$type

Subset and Filter observations

We can subset data frames based on the value of observations. Here we report cars that have mpgCity above 35 and report only the variables mpgCity, driveTrain, and passengers. Use filter() and select() in the package dplyr. We must use the piping operator %>%.

library(dplyr)
cars %>% filter(mpgCity > 35) %>% select(mpgCity, driveTrain, passengers)

Summarizing Data

Let’s summarize the data using some basic summary statistics.

# mean
mean(cars$price)
mean(~ price | type, data = cars)

# standard deviation
sd(cars$price)
sd(~ price | type, data = cars)

# median
median(cars$price)
median(~ price | type, data = cars)

# IQR (need to load mosaic package first)
iqr(cars$price)  
iqr(~ price | type, data = cars)

# many summary statistics at once (need to load mosaic package first)
favstats(cars$price)
favstats(~ price, data = cars)
favstats(~ price | type, data = cars)

Visualizing Data

We can also visualize the data using histograms, boxplots, and scatterplots (among many other options).

Histograms

Histograms are used to visualize a single quantitative variable. Use the histogram() function from the mosaic package.

library(mosaic)
histogram(cars$price)

A glance at the documentation (type ?histogram in console) reveals a number of additional arguments. You should give each plot a label (main = "...") and remember to label your axes (xlab = "...", ylab = "..."). the breaks argument suggests to R how many bars to use for the histogram. By default R will attempt to find a decent value for this argument.

Boxplots

Boxplots are used to visualize the relationship between a categorical and numerical variable. In the cars dataset, the variable type takes values 4WD, front, and rear.

unique(cars$type)
bwplot(price ~ type, data = cars)

We can also use a boxplot to visualize a single numerical variable, but this is less common.

bwplot(cars$price, xlab = "Price")

Bargraphs

Bargraphs can be made using the function bargraph() from the mosaic package.

bargraph( ~ passengers, data = cars)

Functions form the mosaic package generally allow us to subset by group. Here we subset by the type variable.

bargraph( ~ passengers | type, data = cars, auto.key = TRUE)

Scatterplots

Finally, to visualize the relationship between two numerical variables we use a scatterplot. This is done using the plot function and the ~ formula format.

plot(mpgCity ~ weight, data = cars)

Examining the documentation (?plot) reveals a number of ways to make our scatterplot more attractive and readable.

plot(mpgCity ~ weight, data = cars,
     xlab = "Weight",
     ylab = "mpg City",
     main = "mpg City versus Weight",
     color = "red")

We can also use the gf_point function from ggmosaic.

library(ggmosaic)
gf_point(mpgCity ~ weight, data = cars, col = "red")

This function also allows us to easily plot by group. We can plot three separate plots using the vertical slash |.

gf_point(mpgCity ~ weight | type, data = cars)

Or we can use a single plot and use the col argument to change color based on type. Be sure to add a key/legend using the auto.key argument to indicate what color corresponds to what group level. (Don’t forget to include the ~ when using a variable to define the color!)

gf_point(mpgCity ~ weight, col = ~type, data = cars)

Incorporating dplyr

We can also use the piping operator %>% and functions from the dplyr package when plotting. For example, to plot the mpgCity versus weight for only cheap cars, we can use the command below.

gf_point(mpgCity ~ weight,
       col = ~type,
       data = cars %>% filter(price < 19),
       main = "mpg versus weight for cheap cars",
       xlab = "weight",
       ylab = "mpg (city)")