Unit 4 Notes

*Note: there is typically a great deal of overlap between the video and the lecture notes. In this unit, there is significantly less and you absolutely should watch the videos AND read the lecture notes.

Overview

As you learned in the recording of an RStudio webinar, R Markdown is an implementation of Markdown in R that allows you to type in plain text and render in more complex formats like html and pdf.

We will be using R Markdown to generate html for our assignments and the class project. As a reminder, if you haven’t already done so, you need to create an account on RPubs to publish your assignments.

Markdown Basics

John Gruber originally wrote markdown to simplify the writing workflow for writers that would ultimately format their work as html (usually bloggers). It is exceedingly simple to understand and learn. Rather than explain it here, please go through the Markdown Tutorial created by Garen Torikian and return to these notes after you have completed the brief tutorial.

There is no defined standard for Markdown and Gruber’s early work has been expanded by multiple authors in multiple directions leading to fragmentation.

For the most part, we will be using the Pandoc markdown format which is the default R Markdown format in Rstudio. Pandoc extends the initial version of Gruber’s markdown significantly but much of what we’ll be using in this class, you already learned in the markdown tutorial. RStudio also allows for a few different markdown flavors, which we won’t be covering this semester. Markdown has an interesting cross-section of users. Some bloggers tend to value markdown for distraction free writing (e.g., no menus, minimal codes, no mouse use). Some academics tend to favor markdown because they are writing but there are some flavors of markdown that support typesetting of mathematical formulas without having to learn all of the intricacies of LaTeX, which is heavily used in the sciences.

R Markdown

In the initial video, I am using vanilla markdown in an editor (on the left) that dynamically renders the markdown (on the right). Below is an image of that.
Markdown editor

In RStudio, we don’t dynamically render markdown because the code we insert may make it too cumbersome for a dynamic renderer. The value that is added in R Markdown is the ability to add code chunks and execute R commands inside of your markdown document that you can then render to html or pdf and use in a report. In RStudio, it is simple to create a new R Markdown document. Select File --> New File --> R Markdown and then select the type of R Markdown document you want to create (Document) and output type (html) and you can give the document a title – *note: this is not the file name, it is the html document title which will appear in the toolbar of your browser, is used for bookmarking pages, and also in search engine results – specifically, it renders to the <html> title tag.

markdown create dialog

RStudio creates a markdown file with some sample code. There are a few elements in there that should be familiar to you (e.g., the link to RStudio, the bolded word Knit (with the two stars on each end), and the headings (level 2 ##). In the video, I’ll go through the entire sample document but there are a few things I want to reiterate here in the lecture notes.

embedded code chunks are enclosed with a starting line of three backticks ``` and {r chunkname} and an ending line of three backticks ```
by default, embedded code chunks will show the code and the output of that code.
there are several options, for example, if we want display the output and not the code, we use {r chunkname, echo = FALSE}
to display the code and not the output, we use {r chunkname, results = "hide"}
to have the code evaluated, but display neither the code nor the output, we use {r chunkname, include = FALSE}
the chunk name is optional but it is good practice to name your chunks because it makes troubleshooting easier.

If you wan’t to look at some more chunk options, reference the R Markdown Reference Guide and see the knitr chunk options section (R Markdown uses the knitr library).

Embedding plots

To date, we haven’t really created visualizations in R. There are many packages that extend the base graphics available to us in R. We will be using the ggvis package for most of this semester – it will also be prominently used in the visualization course.

Embedding plots in R Markdown is pretty straightforward but there are a couple of things we need to consider:

What format is my final report going to be output to (e.g., html, pdf, Word, ioslides, etc.).
How are my reader’s going to consume my report (desktop/laptop, mobile device, printed)

For this unit, we’ll just be working with html because it is the most dynamic form of output. For figures, we might want to set the fig.width and fig.height. If we don’t, they will default to seven (measurements are in inches). In the image below, I have fig.width=6 and fig.height=8.

library(ggvis)
library(dplyr)
mtcars %>% ggvis(~wt, ~mpg) %>% layer_points()

We’ll be working with ggvis more throughout the semester but one issue you’ll come across is that it has the ability to create interactive graphics. The figure above isn’t very interactive. It only allows us to change the size, by grabbing bottom right, change the way the image is rendered or download the image (gear top right).

We created the visualization above by piping the data (mtcars) to ggvis and telling it that we want to plot the variables wt on the x-axis and mpg on the y-axis. layer_points() actually creates the plot and tells ggvis to use points as the visual encoding, thus making a scatterplot. We’ll cover more ggvis in the next unit.

Prettier tables

There are some decent formatting options for tables in R Markdown. Below are three different versions of the same table:

Default

output <- head(mtcars %>% select(wt, mpg), 10)
output

##                      wt  mpg
## Mazda RX4         2.620 21.0
## Mazda RX4 Wag     2.875 21.0
## Datsun 710        2.320 22.8
## Hornet 4 Drive    3.215 21.4
## Hornet Sportabout 3.440 18.7
## Valiant           3.460 18.1
## Duster 360        3.570 14.3
## Merc 240D         3.190 24.4
## Merc 230          3.150 22.8
## Merc 280          3.440 19.2

kable

library(knitr)
kable(output, digits = 2)

	wt	mpg
Mazda RX4	2.62	21.0
Mazda RX4 Wag	2.88	21.0
Datsun 710	2.32	22.8
Hornet 4 Drive	3.21	21.4
Hornet Sportabout	3.44	18.7
Valiant	3.46	18.1
Duster 360	3.57	14.3
Merc 240D	3.19	24.4
Merc 230	3.15	22.8
Merc 280	3.44	19.2

pander

library(pander)
panderOptions('round', 2)
set.caption("mtcars weight and mpg")
pander(output)

mtcars weight and mpg
	wt	mpg
Mazda RX4	2.62	21
Mazda RX4 Wag	2.88	21
Datsun 710	2.32	22.8
Hornet 4 Drive	3.21	21.4
Hornet Sportabout	3.44	18.7
Valiant	3.46	18.1
Duster 360	3.57	14.3
Merc 240D	3.19	24.4
Merc 230	3.15	22.8
Merc 280	3.44	19.2

There are several other table formatting packages that we aren’t going to cover but to summarize:

default tables aren’t too pretty
kable is part of knitr and is a simple way to make prettier tables
pander has more options than kable, but is more complex
there are a variety of other packages that might suit your specific need (e.g., xtable, htmltables, etc.) but for this class pander and kable should have you covered.

Knitting

At any time, you can knit your R Markdown file. In RStudio, you can use the Knit or KnitHTML button to specify if you want to knit to html, pdf, or a Microsoft Word document. What actually happens behind the scenes is that RStudio uses the rmarkdown package to render the output in your specified format. The video will have more detailed use of R Markdown.

dplyr verbs

You’ve already been introduced to dplyr but it is often difficult to get used to asking questions in dplyr instead of plain english. One way to help improve your thought process is to understand the verbs of dplyr and their purpose.

select chooses specific columns.
rename renames specific columns and selects all.
filter chooses specific rows.
arrange sorts rows.
mutate creates new columns.
transmute is like mutate but doesn’t keep your old columns.
distinct returns unique rows.
summarize aggregates or chunks.
slice selects rows by position.
sample takes samples of data (seldom used).

We won’t be discussing sample as it is more commonly used in the sciences, but the other verbs are all commonly used. The other two key non-verb actions in dplyr are group_by, which is typically applied when using summarize and the pipe operator %>% which is used to combine verbs. I give a better visual representation of the queries below in the video, but let’s start by reading in a csv file from a url and having a quick look at it.

got <- read.csv(url("http://jamessuleiman.com/mba676/assets/units/unit4/got.csv"),
                stringsAsFactors = FALSE)
got

##     lastname firstname          major      year  gpa
## 1       Snow      John Nordic Studies    Junior 3.23
## 2  Lannister    Tyrion Communications Sophomore 3.83
## 3  Targaryen  Daenerys        Zoology  Freshman 3.36
## 4     Bolton    Ramsay        Phys Ed  Freshman 2.24
## 5      Stark    Eddard        History    Senior 2.78
## 6    Clegane    Gregor        Phys Ed Sophomore 3.23
## 7    Baelish     Peter Communications  Freshman 2.84
## 8  Baratheon   Joffrey        History  Freshman 1.87
## 9      Drogo      Khal        Zoology    Senior 3.38
## 10     Tarly   Samwise Nordic Studies  Freshman 2.39

We have already discussed the pipe operator %>% in the last unit. If we wanted to use select to choose specific columns (e.g., lastname, firstname, gpa). Within select, I can also change column names. Please note: I am not storing the results of these queries in any variables…I am sending them directly out to output (i.e., printing them out).

got %>% select(surname = lastname, firstname, gpa)

##      surname firstname  gpa
## 1       Snow      John 3.23
## 2  Lannister    Tyrion 3.83
## 3  Targaryen  Daenerys 3.36
## 4     Bolton    Ramsay 2.24
## 5      Stark    Eddard 2.78
## 6    Clegane    Gregor 3.23
## 7    Baelish     Peter 2.84
## 8  Baratheon   Joffrey 1.87
## 9      Drogo      Khal 3.38
## 10     Tarly   Samwise 2.39

We can use rename to change column names…it selects all the columns in the data frame. So if I wanted to show the entire data frame using the more formal surname instead of lastname, I could do the following without having to specify all of the names in select.

got %>% rename(surname = lastname)

##      surname firstname          major      year  gpa
## 1       Snow      John Nordic Studies    Junior 3.23
## 2  Lannister    Tyrion Communications Sophomore 3.83
## 3  Targaryen  Daenerys        Zoology  Freshman 3.36
## 4     Bolton    Ramsay        Phys Ed  Freshman 2.24
## 5      Stark    Eddard        History    Senior 2.78
## 6    Clegane    Gregor        Phys Ed Sophomore 3.23
## 7    Baelish     Peter Communications  Freshman 2.84
## 8  Baratheon   Joffrey        History  Freshman 1.87
## 9      Drogo      Khal        Zoology    Senior 3.38
## 10     Tarly   Samwise Nordic Studies  Freshman 2.39

If I wanted to filter the results above to just show gpa’s that are greater than or equal to 3.5, I would pipe the results to filter to choose those specific rows.

got %>% rename(surname = lastname) %>% filter(gpa >= 3.5)

##     surname firstname          major      year  gpa
## 1 Lannister    Tyrion Communications Sophomore 3.83

If, instead, I just wanted to sort the selected columns from highest to lowest gpa, I would use arrange.

got %>% rename(surname = lastname) %>% arrange(desc(gpa))

##      surname firstname          major      year  gpa
## 1  Lannister    Tyrion Communications Sophomore 3.83
## 2      Drogo      Khal        Zoology    Senior 3.38
## 3  Targaryen  Daenerys        Zoology  Freshman 3.36
## 4       Snow      John Nordic Studies    Junior 3.23
## 5    Clegane    Gregor        Phys Ed Sophomore 3.23
## 6    Baelish     Peter Communications  Freshman 2.84
## 7      Stark    Eddard        History    Senior 2.78
## 8      Tarly   Samwise Nordic Studies  Freshman 2.39
## 9     Bolton    Ramsay        Phys Ed  Freshman 2.24
## 10 Baratheon   Joffrey        History  Freshman 1.87

Suppose I wanted to create a dean’s list column called dlist and set it to TRUE if the gpa >= 3.5 and FALSE otherwise. I would use mutate for that. Note: in this example, the column is only created in the output and the data frame is unaltered.

got %>% rename(surname = lastname) %>% mutate(dlist = gpa >= 3.5)

##      surname firstname          major      year  gpa dlist
## 1       Snow      John Nordic Studies    Junior 3.23 FALSE
## 2  Lannister    Tyrion Communications Sophomore 3.83  TRUE
## 3  Targaryen  Daenerys        Zoology  Freshman 3.36 FALSE
## 4     Bolton    Ramsay        Phys Ed  Freshman 2.24 FALSE
## 5      Stark    Eddard        History    Senior 2.78 FALSE
## 6    Clegane    Gregor        Phys Ed Sophomore 3.23 FALSE
## 7    Baelish     Peter Communications  Freshman 2.84 FALSE
## 8  Baratheon   Joffrey        History  Freshman 1.87 FALSE
## 9      Drogo      Khal        Zoology    Senior 3.38 FALSE
## 10     Tarly   Samwise Nordic Studies  Freshman 2.39 FALSE

If I just wanted to show my transformed variables and no other variables, I could use transmute

got %>% transmute(name = paste(firstname, lastname), dlist = gpa >= 3.5)

##                  name dlist
## 1           John Snow FALSE
## 2    Tyrion Lannister  TRUE
## 3  Daenerys Targaryen FALSE
## 4       Ramsay Bolton FALSE
## 5        Eddard Stark FALSE
## 6      Gregor Clegane FALSE
## 7       Peter Baelish FALSE
## 8   Joffrey Baratheon FALSE
## 9          Khal Drogo FALSE
## 10      Samwise Tarly FALSE

If we wanted to list the majors represented in the got data frame, we would use distinct, which restricts to unique(distinct) output.

got %>% distinct(major)

##            major
## 1 Nordic Studies
## 2 Communications
## 3        Zoology
## 4        Phys Ed
## 5        History

Aggregation often adds the most complexity to a query and it is quite common to see summarize combined with group_by. For example, if we wanted to show the average gpa for each major, we would use group_by to declare that we are doing a calculation for each major and use summarize to define the mean calculation. You’ll notice that instead of a data frame, we are outputing a tibble, which is essentially an enhanced data frame that can store more complex data.

got %>% group_by(major) %>% summarize(average_gpa = mean(gpa))

## # A tibble: 5 x 2
##            major average_gpa
##            <chr>       <dbl>
## 1 Communications       3.335
## 2        History       2.325
## 3 Nordic Studies       2.810
## 4        Phys Ed       2.735
## 5        Zoology       3.370

Suppose we wanted to show the name of the student with the highest gpa for each major. We could do this in a few different ways. In all cases, since we are doing it for each major, we will be using group_by(major). In the first case, after grouping, we sort in descending gpa order and slice out the first(1) instance of each student.

got %>% group_by(major) %>% arrange(desc(gpa)) %>% slice(1)

## Source: local data frame [5 x 5]
## Groups: major [5]
## 
##    lastname firstname          major      year   gpa
##       <chr>     <chr>          <chr>     <chr> <dbl>
## 1 Lannister    Tyrion Communications Sophomore  3.83
## 2     Stark    Eddard        History    Senior  2.78
## 3      Snow      John Nordic Studies    Junior  3.23
## 4   Clegane    Gregor        Phys Ed Sophomore  3.23
## 5     Drogo      Khal        Zoology    Senior  3.38

In the second case, we decide we want to use the top_n function.

got %>% group_by(major) %>% arrange(desc(gpa)) %>% top_n(1)

## Selecting by gpa

## Source: local data frame [5 x 5]
## Groups: major [5]
## 
##    lastname firstname          major      year   gpa
##       <chr>     <chr>          <chr>     <chr> <dbl>
## 1 Lannister    Tyrion Communications Sophomore  3.83
## 2     Drogo      Khal        Zoology    Senior  3.38
## 3      Snow      John Nordic Studies    Junior  3.23
## 4   Clegane    Gregor        Phys Ed Sophomore  3.23
## 5     Stark    Eddard        History    Senior  2.78

In the third case, we use the min_rank function within filter.

got %>% group_by(major) %>% filter(min_rank(desc(gpa)) == 1)

## Source: local data frame [5 x 5]
## Groups: major [5]
## 
##    lastname firstname          major      year   gpa
##       <chr>     <chr>          <chr>     <chr> <dbl>
## 1      Snow      John Nordic Studies    Junior  3.23
## 2 Lannister    Tyrion Communications Sophomore  3.83
## 3     Stark    Eddard        History    Senior  2.78
## 4   Clegane    Gregor        Phys Ed Sophomore  3.23
## 5     Drogo      Khal        Zoology    Senior  3.38

This should seem somewhat confusing and perhaps it is best to describe what is going on here. top_n, which we introduced last week, is an easier to use “wrapper” function that combines filter and min_rank. slice was added later to dplyr to make it simpler to not just select the top. For example, if I wanted to select positions 2 through 4, I would use slice(2:4) There is no equivalent top_n for this and I would end up resorting to the harder to follow filter(min_rank(...) %in c(2:4) To simplify, you should try to get comfortable with slice but feel free to use top_n as well.

dplyr window functions

As defined in the dplyr documentation, a window function is a variation on an aggregation function. Where an aggregation function, like sum() and mean(), takes n inputs and return a single value, a window function returns n values.

The window functions we’ll be dealing with in this class are often ranking functions (like min_rank()) and offset functions (like lag() which was introduced in the last unit). If you have ever worked with relational databases, window functions are commonly implemented in SQL.

The ranking and ordering functions you may use in dplyr are:

row_number()
min_rank() which allows for gaps in ranks (e.g., if two rows are tied for first, the next rank is third)
dense_rank() which doesn’t allow for gaps in ranks (e.g., if two rows are tied for first, the next rank is second)
percent_rank() a number between 0 and 1 computed by rescaling min_rank to [0, 1].
cume_dist() a cumulative distribution function. Proportion of all values less than or equal to the current rank.
ntile() a rough rank, which breaks the input vector into n buckets

If you look at a ranking of gpa’s in the got data, 3.23 is tied for fourth place and 2.84, which is the 6th row in the arranged data frame would be in sixth place using min_rank(), fifth place, using dense_rank()

got %>% filter(row_number(desc(gpa)) == 6)

##   lastname firstname          major     year  gpa
## 1  Baelish     Peter Communications Freshman 2.84

got %>% filter(min_rank(desc(gpa)) == 6)

##   lastname firstname          major     year  gpa
## 1  Baelish     Peter Communications Freshman 2.84

got %>% filter(dense_rank(desc(gpa)) == 5)

##   lastname firstname          major     year  gpa
## 1  Baelish     Peter Communications Freshman 2.84

We could also use the slice verb to accomplish the same thing.

got %>% arrange(desc(gpa)) %>% slice(6)

##   lastname firstname          major     year  gpa
## 1  Baelish     Peter Communications Freshman 2.84

We’ll add the columns p_rank, c_dist and ntile to show you how the remaining ranking functions work. We’ll use four buckets for ntile()

got %>% select(lastname, firstname, gpa) %>% arrange(desc(gpa)) %>% 
  mutate(p_rank = percent_rank(gpa), cdist = cume_dist(gpa), 
         ntile = ntile(gpa, 4))

##     lastname firstname  gpa    p_rank cdist ntile
## 1  Lannister    Tyrion 3.83 1.0000000   1.0     4
## 2      Drogo      Khal 3.38 0.8888889   0.9     4
## 3  Targaryen  Daenerys 3.36 0.7777778   0.8     3
## 4       Snow      John 3.23 0.5555556   0.7     3
## 5    Clegane    Gregor 3.23 0.5555556   0.7     3
## 6    Baelish     Peter 2.84 0.4444444   0.5     2
## 7      Stark    Eddard 2.78 0.3333333   0.4     2
## 8      Tarly   Samwise 2.39 0.2222222   0.3     1
## 9     Bolton    Ramsay 2.24 0.1111111   0.2     1
## 10 Baratheon   Joffrey 1.87 0.0000000   0.1     1

The offset functions you may use in dplyr are:

lag() returns the previous value in the vector - introduced in the last unit.
lead() returns the next value in a vector - the opposite of lag()

If we wanted to know the gpa of the next better lag() and next worst lead() students I would use:

got %>% arrange(desc(gpa)) %>% mutate(nxt_better = lag(gpa), nxt_worst = lead(gpa))

##     lastname firstname          major      year  gpa nxt_better nxt_worst
## 1  Lannister    Tyrion Communications Sophomore 3.83         NA      3.38
## 2      Drogo      Khal        Zoology    Senior 3.38       3.83      3.36
## 3  Targaryen  Daenerys        Zoology  Freshman 3.36       3.38      3.23
## 4       Snow      John Nordic Studies    Junior 3.23       3.36      3.23
## 5    Clegane    Gregor        Phys Ed Sophomore 3.23       3.23      2.84
## 6    Baelish     Peter Communications  Freshman 2.84       3.23      2.78
## 7      Stark    Eddard        History    Senior 2.78       2.84      2.39
## 8      Tarly   Samwise Nordic Studies  Freshman 2.39       2.78      2.24
## 9     Bolton    Ramsay        Phys Ed  Freshman 2.24       2.39      1.87
## 10 Baratheon   Joffrey        History  Freshman 1.87       2.24        NA

We’ve covered a good portion of dplyr and most of what you’ll be using for the remainder of the semester. Time to help solidify your querying skills with some practice questions.

Practice questions

Starting with this unit, we are at a point where I’m going to give you a few practice (i.e., no stakes) questions where the code is folded (hidden). You should try your hardest to replicate the output but if you get stuck, you can always click on the code button to see the code.