Learning Objectives:
- Get started with
"dplyr"
and its basic verbs:
slice()
,filter()
,select()
mutate()
arrange()
summarise()
group_by()
- Get started with
"ggplot2"
- Produce basic plots with
ggplot()
Rmd
(R markdown) file.lab02-first-last.Rmd
, where first
and last
are your first and last names (e.g. lab02-gaston-sanchez.Rmd
).Rmd
file as an html document (default option).Rmd
and html
files to bCourses, in the corresponding lab assignment.In this lab, you will start learning a couple of approaches to manipulate tables and create statistical graphics. We are going to use the functionality of the package "dplyr"
to work with tabular data (in a syntactic way). This is a fairly recent package introduced a couple of years ago, but it is based on more than a decade of research and work lead by Hadley Wickham.
Likewise, to create graphics in a fairly consistent and visually pleasing way, we are going to use the package "ggplot2"
, also originally authored by Hadley Wickham, and developed as part of his PhD more than a decade ago.
While you follow this lab, you may want to open these cheat sheets:
I’m assuming that you already installed the packages "dplyr"
and "ggplot2"
. If that’s not the case then run on the console the command below (do NOT include this command in your Rmd
):
# don't include this command in your Rmd file
# don't worry too much if you get a warning message
install.packages(c("dplyr", "ggplot2"))
Remember that you only need to install a package once! After a package has been installed in your machine, there is no need to call install.packages()
again on the same package. What you should always invoke, in order to use the functions in a package, is the library()
function:
# (include these commands in your Rmd file)
# load the packages
library(dplyr)
library(ggplot2)
About loading packages: Another rule to keep in mind is to always load any required packages at the very top of your script files (.R
or .Rmd
or .Rnw
files). Avoid calling the library()
function in the middle of a script. Instead, load all the packages before anything else.
starwars
The data file for this lab has to do with Star Wars characters. The dataset is part of the dplyr
package: starwars
. So, assuming that you loaded the package "dplyr"
, then simply type the name of the object: starwars
# assuming you loaded dplyr ...
starwars
"dplyr"
verbsTo make the learning process of "dplyr"
gentler, Hadley Wickham proposes beginning with a set of five basic verbs or operations for data frames (each verb corresponds to a function in "dplyr"
):
I’ve slightly modified Hadley’s list of verbs:
filter()
, slice()
, and select()
: subsetting and selecting rows and columnsmutate()
: add new variablesarrange()
: reorder rowssummarise()
: reduce variables to valuesgroup_by()
: grouped (aggregated) operationsslice()
allows you to select rows by position:
# first three rows
three_rows <- slice(starwars, 1:3)
three_rows
## # A tibble: 3 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
filter()
allows you to select rows by defining a condition (which could be simple or compound):
# subset rows given a simple condition
# (height greater than 200 cm)
gt_200 <- filter(starwars, height > 200)
gt_200
## # A tibble: 10 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Dart… 202 136 none white yellow 41.9 male mascu…
## 2 Chew… 228 112 brown unknown blue 200 male mascu…
## 3 Roos… 224 82 none grey orange NA male mascu…
## 4 Rugo… 206 NA none green orange NA male mascu…
## 5 Yara… 264 NA none white yellow NA male mascu…
## 6 Lama… 229 88 none grey black NA male mascu…
## 7 Taun… 213 NA none grey black NA fema… femin…
## 8 Grie… 216 159 none brown, wh… green, y… NA male mascu…
## 9 Tarf… 234 136 brown brown blue NA male mascu…
## 10 Tion… 206 80 none grey black NA male mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
# subset rows given a compound condition
filter(starwars, height > 200 & mass < 100)
## # A tibble: 3 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Roos… 224 82 none grey orange NA male mascu…
## 2 Lama… 229 88 none grey black NA male mascu…
## 3 Tion… 206 80 none grey black NA male mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
select()
allows you to select one or more columns by name:
# columns by name
name_height <- select(starwars, name, height)
slice()
to subset the data by selecting the first 5 rows.slice(starwars, 1:5)
## # A tibble: 5 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Dart… 202 136 none white yellow 41.9 male mascu…
## 5 Leia… 150 49 brown light brown 19 fema… femin…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
slice()
to subset the data by selecting rows 10, 15, 20, …, 50. Optional hint: seq()
is your friend.slice(starwars, seq(10,50,5))
## # A tibble: 9 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Obi-… 182 77 auburn, w… fair blue-gray 57 male mascu…
## 2 Gree… 173 74 <NA> green black 44 male mascu…
## 3 Palp… 170 75 grey pale yellow 82 male mascu…
## 4 Lobot 175 79 none light blue 37 male mascu…
## 5 Nien… 160 68 none grey black NA male mascu…
## 6 Roos… 224 82 none grey orange NA male mascu…
## 7 Quar… 183 NA black dark brown 62 <NA> <NA>
## 8 Dud … 94 45 none blue, grey yellow NA male mascu…
## 9 Kit … 196 87 none green black NA male mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
slice()
to subset the data by selecting the last 5 rows.slice(starwars, 83:87)
## # A tibble: 5 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Rey NA NA brown light hazel NA fema… femin…
## 2 Poe … NA NA brown light brown NA male mascu…
## 3 BB8 NA NA none none black NA none mascu…
## 4 Capt… NA NA unknown unknown unknown NA <NA> <NA>
## 5 Padm… 165 45 brown light brown 46 fema… femin…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
filter()
to subset those individuals with height less than 100 cm tall.filter(starwars, height < 100)
## # A tibble: 7 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 2 R5-D4 97 32 <NA> white, red red NA none mascu…
## 3 Yoda 66 17 white green brown 896 male mascu…
## 4 Wick… 88 20 brown brown brown 8 male mascu…
## 5 Dud … 94 45 none blue, grey yellow NA male mascu…
## 6 Ratt… 79 15 none grey, blue unknown NA male mascu…
## 7 R4-P… 96 NA none silver, r… red, blue NA none femin…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
filter()
to subset rows of female individuals (gender
).filter(starwars, gender == "feminine")
## # A tibble: 17 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Leia… 150 49 brown light brown 19 fema… femin…
## 2 Beru… 165 75 brown light blue 47 fema… femin…
## 3 Mon … 150 NA auburn fair blue 48 fema… femin…
## 4 Shmi… 163 NA black fair brown 72 fema… femin…
## 5 Ayla… 178 55 none blue hazel 48 fema… femin…
## 6 Adi … 184 50 none dark blue NA fema… femin…
## 7 Cordé 157 NA brown light brown NA fema… femin…
## 8 Lumi… 170 56.2 black yellow blue 58 fema… femin…
## 9 Barr… 166 50 black yellow blue 40 fema… femin…
## 10 Dormé 165 NA brown light brown NA fema… femin…
## 11 Zam … 168 55 blonde fair, gre… yellow NA fema… femin…
## 12 Taun… 213 NA none grey black NA fema… femin…
## 13 Joca… 167 NA white fair blue NA fema… femin…
## 14 R4-P… 96 NA none silver, r… red, blue NA none femin…
## 15 Shaa… 178 57 none red, blue… black NA fema… femin…
## 16 Rey NA NA brown light hazel NA fema… femin…
## 17 Padm… 165 45 brown light brown 46 fema… femin…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
filter()
to subset rows of individuals with brown hair color.filter(starwars, hair_color == "brown")
## # A tibble: 18 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Leia… 150 49 brown light brown 19 fema… femin…
## 2 Beru… 165 75 brown light blue 47 fema… femin…
## 3 Chew… 228 112 brown unknown blue 200 male mascu…
## 4 Han … 180 80 brown fair brown 29 male mascu…
## 5 Wedg… 170 77 brown fair hazel 21 male mascu…
## 6 Jek … 180 110 brown fair blue NA male mascu…
## 7 Arve… NA NA brown fair brown NA male mascu…
## 8 Wick… 88 20 brown brown brown 8 male mascu…
## 9 Qui-… 193 89 brown fair blue 92 male mascu…
## 10 Ric … 183 NA brown fair blue NA <NA> <NA>
## 11 Cordé 157 NA brown light brown NA fema… femin…
## 12 Clie… 183 NA brown fair blue 82 male mascu…
## 13 Dormé 165 NA brown light brown NA fema… femin…
## 14 Tarf… 234 136 brown brown blue NA male mascu…
## 15 Raym… 188 79 brown light brown NA male mascu…
## 16 Rey NA NA brown light hazel NA fema… femin…
## 17 Poe … NA NA brown light brown NA male mascu…
## 18 Padm… 165 45 brown light brown 46 fema… femin…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
filter()
and then select()
, to subset rows of individuals from Naboo
, and then display their names.naboo_home_world <-filter(starwars, homeworld == "Naboo")
select(naboo_home_world, name)
## # A tibble: 11 x 1
## name
## <chr>
## 1 R2-D2
## 2 Palpatine
## 3 Jar Jar Binks
## 4 Roos Tarpals
## 5 Rugor Nass
## 6 Ric Olié
## 7 Quarsh Panaka
## 8 Gregar Typho
## 9 Cordé
## 10 Dormé
## 11 Padmé Amidala
"dplyr"
functions to display the names of individuals with green skin color.green_skin <- filter(starwars, skin_color == "green")
select(green_skin, name)
## # A tibble: 6 x 1
## name
## <chr>
## 1 Greedo
## 2 Yoda
## 3 Bossk
## 4 Rugor Nass
## 5 Kit Fisto
## 6 Poggle the Lesser
select()
the name, height, and mass, of male individuals, with brown or black hair color.male_black_brown_hair <- filter(starwars, hair_color == "black" | hair_color == "brown" & sex == "male")
select(male_black_brown_hair, name, height, mass)
## # A tibble: 24 x 3
## name height mass
## <chr> <int> <dbl>
## 1 Biggs Darklighter 183 84
## 2 Chewbacca 228 112
## 3 Han Solo 180 80
## 4 Wedge Antilles 170 77
## 5 Jek Tono Porkins 180 110
## 6 Boba Fett 183 78.2
## 7 Lando Calrissian 177 79
## 8 Arvel Crynyd NA NA
## 9 Wicket Systri Warrick 88 20
## 10 Qui-Gon Jinn 193 89
## # … with 14 more rows
human_female <- filter(starwars, species == "Human" & sex == "female")
select(human_female, name, homeworld)
## # A tibble: 9 x 2
## name homeworld
## <chr> <chr>
## 1 Leia Organa Alderaan
## 2 Beru Whitesun lars Tatooine
## 3 Mon Mothma Chandrila
## 4 Shmi Skywalker Tatooine
## 5 Cordé Naboo
## 6 Dormé Naboo
## 7 Jocasta Nu Coruscant
## 8 Rey <NA>
## 9 Padmé Amidala Naboo
mutate()
Another basic verb is mutate()
which allows you to add new variables. Let’s create a small data frame for the female individuals with three columns: name
, height
, and mass
:
# creating a small data frame step by step
fem <- filter(starwars, sex == "female")
fem <- select(fem, name, height, mass)
fem <- slice(fem, c(1, 2, 5, 6, 8))
fem
## # A tibble: 5 x 3
## name height mass
## <chr> <int> <dbl>
## 1 Leia Organa 150 49
## 2 Beru Whitesun lars 165 75
## 3 Ayla Secura 178 55
## 4 Adi Gallia 184 50
## 5 Luminara Unduli 170 56.2
Now, let’s use mutate()
to (temporarily) add a column with the ratio height / mass
:
mutate(fem, height / mass)
You can also give a new name, like: ht_wt = height / mass
:
mutate(fem, ht_wt = height / mass)
In order to permanently change the data, you need to assign the changes to an object:
fem2 <- mutate(fem, ht_m = height * 0.0254, wt_kg = mass * 0.4536)
fem2
arrange()
The next basic verb of "dplyr"
is arrange()
which allows you to reorder rows. For example, here’s how to arrange the rows of fem
by height
# order rows by height (increasingly)
arrange(fem, height)
By default arrange()
sorts rows in increasing order. To arrange rows in descending order you need to use the auxiliary function desc()
.
# order rows by height (decreasingly)
arrange(fem, desc(height))
# order rows by height, and then mass
arrange(fem, height, mass)
fem
, add a new variable product
with the product of height
and mass
.mutate(fem, product = height*mass)
## # A tibble: 5 x 4
## name height mass product
## <chr> <int> <dbl> <dbl>
## 1 Leia Organa 150 49 7350
## 2 Beru Whitesun lars 165 75 12375
## 3 Ayla Secura 178 55 9790
## 4 Adi Gallia 184 50 9200
## 5 Luminara Unduli 170 56.2 9554
fem3
, by adding columns log_height
and log_mass
with the log transformations of height
and mass
.fem3 <- mutate(fem, log_height = log10(height), log_mass = log10(mass))
fem3
## # A tibble: 5 x 5
## name height mass log_height log_mass
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Leia Organa 150 49 2.18 1.69
## 2 Beru Whitesun lars 165 75 2.22 1.88
## 3 Ayla Secura 178 55 2.25 1.74
## 4 Adi Gallia 184 50 2.26 1.70
## 5 Luminara Unduli 170 56.2 2.23 1.75
filter()
and arrange()
those individuals with height less than 150 cm tall, in increasing order by height.height_order_150 <- filter(starwars, height < 150)
arrange(height_order_150, height)
## # A tibble: 10 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Yoda 66 17 white green brown 896 male mascu…
## 2 Ratt… 79 15 none grey, blue unknown NA male mascu…
## 3 Wick… 88 20 brown brown brown 8 male mascu…
## 4 Dud … 94 45 none blue, grey yellow NA male mascu…
## 5 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 6 R4-P… 96 NA none silver, r… red, blue NA none femin…
## 7 R5-D4 97 32 <NA> white, red red NA none mascu…
## 8 Sebu… 112 40 none grey, red orange NA male mascu…
## 9 Gasg… 122 NA none white, bl… black NA male mascu…
## 10 Watto 137 NA black blue, grey yellow NA male mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
top_height <-arrange(starwars, desc(height))
top5_height <- slice(top_height,1:5)
select(top5_height, name, homeworld, species)
## # A tibble: 5 x 3
## name homeworld species
## <chr> <chr> <chr>
## 1 Yarael Poof Quermia Quermian
## 2 Tarfful Kashyyyk Wookiee
## 3 Lama Su Kamino Kaminoan
## 4 Chewbacca Kashyyyk Wookiee
## 5 Roos Tarpals Naboo Gungan
top_weight <-arrange(starwars, desc(mass))
top5_weight <- slice(top_weight,1:5)
select(top5_weight, name, homeworld, species)
## # A tibble: 5 x 3
## name homeworld species
## <chr> <chr> <chr>
## 1 Jabba Desilijic Tiure Nal Hutta Hutt
## 2 Grievous Kalee Kaleesh
## 3 IG-88 <NA> Droid
## 4 Darth Vader Tatooine Human
## 5 Tarfful Kashyyyk Wookiee
summarise()
The next verb is summarise()
. Conceptually, this involves applying a function on one or more columns, in order to summarize values. This is probably easier to understand with one example.
Say you are interested in calculating the average height of all individuals. To do this “a la dplyr” you use summarise()
, or its synonym function summarize()
:
# average height (removing missing values)
summarise(starwars, avg_height = mean(height, na.rm = TRUE))
## # A tibble: 1 x 1
## avg_height
## <dbl>
## 1 174.
What if you want to calculate some summary statistics for height
: min, median, mean, and max?
# some stats for height (dplyr)
summarise(
starwars,
min = min(height, na.rm = TRUE),
median = median(height, na.rm = TRUE),
avg = mean(height, na.rm = TRUE),
max = max(height, na.rm = TRUE)
)
## # A tibble: 1 x 4
## min median avg max
## <int> <int> <dbl> <int>
## 1 66 180 174. 264
To actually appreciate the power of summarise()
, we need to introduce the other major basic verb in "dplyr"
: group_by()
. This is the function that allows you to perform data aggregations, or grouped operations.
Let’s see the combination of summarise()
and group_by()
to calculate the average salary by team:
# average height, grouped by homeworld
summarise(
group_by(starwars, homeworld),
avg_salary = mean(height, na.rm = TRUE)
)
## `summarise()` ungrouping output (override with `.groups` argument)
Here’s a more fancy example: average mass and height, by homeworld, displayed in desceding order by average height:
arrange(
summarise(
group_by(starwars, homeworld),
avg_height = mean(height, na.rm = TRUE),
avg_mass = mean(mass, na.rm = TRUE)),
desc(avg_height)
)
## `summarise()` ungrouping output (override with `.groups` argument)
summarise()
to get the largest height value.largest_height<- summarise(starwars, height)
largest_height <- arrange(largest_height,desc(height))
slice(largest_height,1)
## # A tibble: 1 x 1
## height
## <int>
## 1 264
summarise()
to get the standard deviation of mass
.sd_mass <- summarise(starwars, mass)
sd_mass <- data.matrix(sd_mass)
sd(sd_mass, na.rm = TRUE)
## [1] 169.4572
summarise()
and group_by()
to display the median of mass, by homeworldsummarise(group_by(starwars, homeworld), median_mass = median(mass, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 49 x 2
## homeworld median_mass
## <chr> <dbl>
## 1 Alderaan 64
## 2 Aleen Minor 15
## 3 Bespin 79
## 4 Bestine IV 110
## 5 Cato Neimoidia 90
## 6 Cerea 82
## 7 Champala NA
## 8 Chandrila NA
## 9 Concord Dawn 79
## 10 Corellia 78.5
## # … with 39 more rows
avg_mass_gender<-summarise(group_by(starwars, sex), average_mass = mean(mass, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
arrange(avg_mass_gender, average_mass)
## # A tibble: 5 x 2
## sex average_mass
## <chr> <dbl>
## 1 <NA> 48
## 2 female 54.7
## 3 none 69.8
## 4 male 81.0
## 5 hermaphroditic 1358
height
, for female characters.avg_height_female<-filter(starwars,sex=="female")
avg_height_female1<-select(avg_height_female,height)
avg_height_female<- summarise(avg_height_female1,average_height = mean(height, na.rm = TRUE))
sd_avg_height_female<- data.matrix(avg_height_female1)
sd(sd_avg_height_female, na.rm = TRUE)
## [1] 15.32256
avg_height_female
## # A tibble: 1 x 1
## average_height
## <dbl>
## 1 169.
ggplot()
The package "ggplot2"
is probably the most popular package in R to create beautiful static graphics. Compared to the functions in the base package "graphics"
, the package "ggplot2
" follows a somewhat different philosophy, and it tries to be more consistent and modular as possible.
The main function in "ggplot2"
is ggplot()
The main input to ggplot()
is a data frame object.
You can use the internal function aes()
to specify what columns of the data frame will be used for the graphical elements of the plot.
You must specify what kind of geometric objects or geoms will be displayed: e.g. geom_point()
, geom_bar()
, geom_boxpot()
.
Pretty much anything else that you want to add to your plot is controlled by auxiliary functions, especially those things that have to do with the format, rather than the underlying data.
The construction of a ggplot is done by adding layers with the +
operator.
Let’s start with a scatterplot of height
and mass
# scatterplot (option 1)
ggplot(data = starwars) +
geom_point(aes(x = height, y = mass))
## Warning: Removed 28 rows containing missing values (geom_point).
ggplot()
creates an object of class "ggplot"
ggplot()
is data
which must be a data frame"+"
operator to add a layergeom_points()
aes()
is used to specify the x
and y
coordinates, by taking columns points
and salary
from the data frameThe same scatterplot can also be created with this alternative, and more common use of ggplot()
# scatterplot (option 2)
ggplot(data = starwars, aes(x = height, y = mass)) +
geom_point()
Say you want to color code the points in terms of gender
# colored scatterplot
ggplot(data = starwars, aes(x = height, y = mass)) +
geom_point(aes(color = gender))
## Warning: Removed 28 rows containing missing values (geom_point).
# your code
fem
to make a scatterplot of height
and mass
.ggplot(data = fem, aes(x = height, y = mass)) + geom_point()
height
and mass
, using geom_text()
to display the names of the individualsggplot(data = fem, aes(x = height, y = mass)) + geom_text(aes(label = name))
height
and mass
, for ALL the females, displaying their names with geom_label()
.ggplot(data = filter(starwars, sex == "female"), aes(x = height, y = mass)) + geom_label(aes(label = name))
## Warning: Removed 7 rows containing missing values (geom_label).
mass
(for all individuals).ggplot(data = starwars ,aes(mass)) + geom_histogram(binwidth = 1)
## Warning: Removed 28 rows containing non-finite values (stat_bin).
height
(for all individuals).ggplot(data = starwars ,aes(height)) + geom_density(kernal = "gassuian")
## Warning: Ignoring unknown parameters: kernal
## Warning: Removed 6 rows containing non-finite values (stat_density).
gender
frequencies (for all individuals).ggplot(starwars, aes(sex)) + geom_bar()