library(tibble) # special type of data frame
library(magrittr) # pipes
library(dplyr) # data manipulation
library(ggplot2) # pretty plots
library(tidyr) # reshape data frames; mostly for ggplots
This is a very simple streamlined set of exmple code to describe some basic summary statistics and graphs to facilitate preliminary data exploration.
We will be using a few data sets during the course, one of the classic statistical learning data sets is the iris data set which comes with any standard installation of R. This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
?iris # open help file for the object
dim(iris) # get dimensions of data
[1] 150 5
class(iris) # a data frame class object
[1] "data.frame"
The first goal for any analysis is to understand the dimensions and structure of the data set you are working with. Here are some simple commands to understand the general structure of your data set, typically in the form of a data.frame (or nowadays tibble) class object:
dim(iris) # dimensions (rows, columns)
[1] 150 5
str(iris) # structure of the data frame object
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
iris[1:5, ] # echo the first 5 rows
names(iris) # what are the column names
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
sapply(iris, class) # what are the classes of each column (very important!); feature vs. class data
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
"numeric" "numeric" "numeric" "numeric" "factor"
table(iris$Species) # once you know which column contains the classes, what are the class counts?
setosa versicolor virginica
50 50 50
The summary function is a special class of generic functions (called S3) that perform specific operations depending on the class object they are performing on … in this case a data.frame class object. The default is to summarize the columns, reporting a different summary for each:
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
However, it is often more useful from a classification perspective to know the feature means by class, which can be accomplished via the tapply function:
sapply(iris[, -5], function(x)
tapply(x, iris$Species, mean)) # feature means split by Species
Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026
There is a new syntax that has become quite popular in recent years, referred to simply as tidy data analysis, spearheaded by Hadley Wickham. Here is how you would get to the same answer using this syntax:
iris %>%
tibble::as.tibble() %>% # make 'iris' a `tibble` data frame
dplyr::group_by(Species) %>%
dplyr::summarise_all(funs(mean))
All of the following commands produce exactly the same output:
split(iris$Sepal.Length, iris$Species)
$setosa
[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 5.7 5.1 5.4 5.1 4.6
[24] 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8
[47] 5.1 4.6 5.3 5.0
$versicolor
[1] 7.0 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1 6.3
[24] 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7
[47] 5.7 6.2 5.1 5.7
$virginica
[1] 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7
[24] 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7
[47] 6.3 6.5 6.2 5.9
split(iris[["Sepal.Length"]], iris$Species)
$setosa
[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 5.7 5.1 5.4 5.1 4.6
[24] 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8
[47] 5.1 4.6 5.3 5.0
$versicolor
[1] 7.0 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1 6.3
[24] 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7
[47] 5.7 6.2 5.1 5.7
$virginica
[1] 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7
[24] 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7
[47] 6.3 6.5 6.2 5.9
split(iris[[1L]], iris$Species)
$setosa
[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 5.7 5.1 5.4 5.1 4.6
[24] 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8
[47] 5.1 4.6 5.3 5.0
$versicolor
[1] 7.0 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1 6.3
[24] 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7
[47] 5.7 6.2 5.1 5.7
$virginica
[1] 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7
[24] 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7
[47] 6.3 6.5 6.2 5.9
split(iris[[which(names(iris) == "Sepal.Length")]], iris$Species)
$setosa
[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 5.7 5.1 5.4 5.1 4.6
[24] 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8
[47] 5.1 4.6 5.3 5.0
$versicolor
[1] 7.0 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1 6.3
[24] 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7
[47] 5.7 6.2 5.1 5.7
$virginica
[1] 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7
[24] 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7
[47] 6.3 6.5 6.2 5.9
split(iris, iris$Species)
par(mfrow = c(2, 2)) # make 2x2 grid for plots
par(mgp = c(2, 0.75, 0), mar = c(3, 4, 3, 1)) # graphics settings; squeeze margins
purrr::walk(names(iris)[-5], function(class) # purrr::walk() suppresses output; side effects only
boxplot(split(iris[[class]], iris$Species), main = class, col = 1:3))
ggplot2df <- iris %>%
tidyr::gather(key = "Feature", value = "cm", -Species) %>% # re-arrange to long format
dplyr::mutate(Feature = gsub("\\.", " ", Feature)) %>% # remove dot in name
tibble::as.tibble() # make tibble for easy viewing
df
df %>%
ggplot(aes(y = cm, x = Species, fill = Species)) +
geom_boxplot(color = "#1F3552", alpha = 0.75, size = 0.5) +
scale_x_discrete(name = "Species") +
ggtitle("Overall Title") +
theme(plot.title = element_text(hjust = 0.5)) +
facet_wrap(~Feature, ncol = 2)
Very useful for visually inspecting variables for patterns of interest. You will typically need to know your outcome of interest a priori, i.e. supervised analysis, or you will not know how to color the points.
plot(iris[, -5], col = iris$Species)
iris %>%
ggplot(aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point(alpha = 0.5, size = 2.5) +
ggtitle("Title Goes Here") +
# this is a useful trick; add `NULL` as last line
# you then don't need to manage the `+` signs above when building the layers
NULL
Created on 2018-09-03 by the Rmarkdown package (v1.10) and R version 3.5.1 (2018-07-02).