*Note: there is typically a great deal of overlap between the video and the lecture notes. In this unit, there is significantly less and you absolutely should watch the videos AND read the lecture notes.
As you learned in the recording of an RStudio webinar, R Markdown is an implementation of Markdown in R that allows you to type in plain text and render in more complex formats like html and pdf.
We will be using R Markdown to generate html for our assignments and the class project. As a reminder, if you haven’t already done so, you need to create an account on RPubs to publish your assignments.
John Gruber originally wrote markdown to simplify the writing workflow for writers that would ultimately format their work as html (usually bloggers). It is exceedingly simple to understand and learn. Rather than explain it here, please go through the Markdown Tutorial created by Garen Torikian and return to these notes after you have completed the brief tutorial.
There is no defined standard for Markdown and Gruber’s early work has been expanded by multiple authors in multiple directions leading to fragmentation.
For the most part, we will be using the Pandoc markdown format which is the default R Markdown format in Rstudio. Pandoc extends the initial version of Gruber’s markdown significantly but much of what we’ll be using in this class, you already learned in the markdown tutorial. RStudio also allows for a few different markdown flavors, which we won’t be covering this semester. Markdown has an interesting cross-section of users. Some bloggers tend to value markdown for distraction free writing (e.g., no menus, minimal codes, no mouse use). Some academics tend to favor markdown because they are writing but there are some flavors of markdown that support typesetting of mathematical formulas without having to learn all of the intricacies of LaTeX, which is heavily used in the sciences.
In the initial video, I am using vanilla markdown in an editor (on the left) that dynamically renders the markdown (on the right). Below is an image of that.
In RStudio, we don’t dynamically render markdown because the code we insert may make it too cumbersome for a dynamic renderer. The value that is added in R Markdown is the ability to add code chunks and execute R commands inside of your markdown document that you can then render to html or pdf and use in a report. In RStudio, it is simple to create a new R Markdown document. Select File --> New File --> R Markdown and then select the type of R Markdown document you want to create (Document) and output type (html) and you can give the document a title – *note: this is not the file name, it is the html document title which will appear in the toolbar of your browser, is used for bookmarking pages, and also in search engine results – specifically, it renders to the <html> title tag.
markdown create dialog
RStudio creates a markdown file with some sample code. There are a few elements in there that should be familiar to you (e.g., the link to RStudio, the bolded word Knit (with the two stars on each end), and the headings (level 2 ##). In the video, I’ll go through the entire sample document but there are a few things I want to reiterate here in the lecture notes.
{r chunkname} and an ending line of three backticks ```{r chunkname, echo = FALSE}{r chunkname, results = "hide"}{r chunkname, include = FALSE}If you wan’t to look at some more chunk options, reference the R Markdown Reference Guide and see the knitr chunk options section (R Markdown uses the knitr library).
To date, we haven’t really created visualizations in R. There are many packages that extend the base graphics available to us in R. We will be using the ggvis package for most of this semester – it will also be prominently used in the visualization course.
Embedding plots in R Markdown is pretty straightforward but there are a couple of things we need to consider:
For this unit, we’ll just be working with html because it is the most dynamic form of output. For figures, we might want to set the fig.width and fig.height. If we don’t, they will default to seven (measurements are in inches). In the image below, I have fig.width=6 and fig.height=8.
library(ggvis)
library(dplyr)
mtcars %>% ggvis(~wt, ~mpg) %>% layer_points()
We’ll be working with ggvis more throughout the semester but one issue you’ll come across is that it has the ability to create interactive graphics. The figure above isn’t very interactive. It only allows us to change the size, by grabbing bottom right, change the way the image is rendered or download the image (gear top right).
We created the visualization above by piping the data (mtcars) to ggvis and telling it that we want to plot the variables wt on the x-axis and mpg on the y-axis. layer_points() actually creates the plot and tells ggvis to use points as the visual encoding, thus making a scatterplot. We’ll cover more ggvis in the next unit.
There are some decent formatting options for tables in R Markdown. Below are three different versions of the same table:
Default
output <- head(mtcars %>% select(wt, mpg), 10)
output
## wt mpg
## Mazda RX4 2.620 21.0
## Mazda RX4 Wag 2.875 21.0
## Datsun 710 2.320 22.8
## Hornet 4 Drive 3.215 21.4
## Hornet Sportabout 3.440 18.7
## Valiant 3.460 18.1
## Duster 360 3.570 14.3
## Merc 240D 3.190 24.4
## Merc 230 3.150 22.8
## Merc 280 3.440 19.2
kable
library(knitr)
kable(output, digits = 2)
| wt | mpg | |
|---|---|---|
| Mazda RX4 | 2.62 | 21.0 |
| Mazda RX4 Wag | 2.88 | 21.0 |
| Datsun 710 | 2.32 | 22.8 |
| Hornet 4 Drive | 3.21 | 21.4 |
| Hornet Sportabout | 3.44 | 18.7 |
| Valiant | 3.46 | 18.1 |
| Duster 360 | 3.57 | 14.3 |
| Merc 240D | 3.19 | 24.4 |
| Merc 230 | 3.15 | 22.8 |
| Merc 280 | 3.44 | 19.2 |
pander
library(pander)
panderOptions('round', 2)
set.caption("mtcars weight and mpg")
pander(output)
| wt | mpg | |
|---|---|---|
| Mazda RX4 | 2.62 | 21 |
| Mazda RX4 Wag | 2.88 | 21 |
| Datsun 710 | 2.32 | 22.8 |
| Hornet 4 Drive | 3.21 | 21.4 |
| Hornet Sportabout | 3.44 | 18.7 |
| Valiant | 3.46 | 18.1 |
| Duster 360 | 3.57 | 14.3 |
| Merc 240D | 3.19 | 24.4 |
| Merc 230 | 3.15 | 22.8 |
| Merc 280 | 3.44 | 19.2 |
There are several other table formatting packages that we aren’t going to cover but to summarize:
kable is part of knitr and is a simple way to make prettier tablespander has more options than kable, but is more complexxtable, htmltables, etc.) but for this class pander and kable should have you covered.At any time, you can knit your R Markdown file. In RStudio, you can use the Knit or KnitHTML button to specify if you want to knit to html, pdf, or a Microsoft Word document. What actually happens behind the scenes is that RStudio uses the rmarkdown package to render the output in your specified format. The video will have more detailed use of R Markdown.
You’ve already been introduced to dplyr but it is often difficult to get used to asking questions in dplyr instead of plain english. One way to help improve your thought process is to understand the verbs of dplyr and their purpose.
select chooses specific columns.rename renames specific columns and selects all.filter chooses specific rows.arrange sorts rows.mutate creates new columns.transmute is like mutate but doesn’t keep your old columns.distinct returns unique rows.summarize aggregates or chunks.slice selects rows by position.sample takes samples of data (seldom used).We won’t be discussing sample as it is more commonly used in the sciences, but the other verbs are all commonly used. The other two key non-verb actions in dplyr are group_by, which is typically applied when using summarize and the pipe operator %>% which is used to combine verbs. I give a better visual representation of the queries below in the video, but let’s start by reading in a csv file from a url and having a quick look at it.
got <- read.csv(url("http://jamessuleiman.com/mba676/assets/units/unit4/got.csv"),
stringsAsFactors = FALSE)
got
## lastname firstname major year gpa
## 1 Snow John Nordic Studies Junior 3.23
## 2 Lannister Tyrion Communications Sophomore 3.83
## 3 Targaryen Daenerys Zoology Freshman 3.36
## 4 Bolton Ramsay Phys Ed Freshman 2.24
## 5 Stark Eddard History Senior 2.78
## 6 Clegane Gregor Phys Ed Sophomore 3.23
## 7 Baelish Peter Communications Freshman 2.84
## 8 Baratheon Joffrey History Freshman 1.87
## 9 Drogo Khal Zoology Senior 3.38
## 10 Tarly Samwise Nordic Studies Freshman 2.39
We have already discussed the pipe operator %>% in the last unit. If we wanted to use select to choose specific columns (e.g., lastname, firstname, gpa). Within select, I can also change column names. Please note: I am not storing the results of these queries in any variables…I am sending them directly out to output (i.e., printing them out).
got %>% select(surname = lastname, firstname, gpa)
## surname firstname gpa
## 1 Snow John 3.23
## 2 Lannister Tyrion 3.83
## 3 Targaryen Daenerys 3.36
## 4 Bolton Ramsay 2.24
## 5 Stark Eddard 2.78
## 6 Clegane Gregor 3.23
## 7 Baelish Peter 2.84
## 8 Baratheon Joffrey 1.87
## 9 Drogo Khal 3.38
## 10 Tarly Samwise 2.39
We can use rename to change column names…it selects all the columns in the data frame. So if I wanted to show the entire data frame using the more formal surname instead of lastname, I could do the following without having to specify all of the names in select.
got %>% rename(surname = lastname)
## surname firstname major year gpa
## 1 Snow John Nordic Studies Junior 3.23
## 2 Lannister Tyrion Communications Sophomore 3.83
## 3 Targaryen Daenerys Zoology Freshman 3.36
## 4 Bolton Ramsay Phys Ed Freshman 2.24
## 5 Stark Eddard History Senior 2.78
## 6 Clegane Gregor Phys Ed Sophomore 3.23
## 7 Baelish Peter Communications Freshman 2.84
## 8 Baratheon Joffrey History Freshman 1.87
## 9 Drogo Khal Zoology Senior 3.38
## 10 Tarly Samwise Nordic Studies Freshman 2.39
If I wanted to filter the results above to just show gpa’s that are greater than or equal to 3.5, I would pipe the results to filter to choose those specific rows.
got %>% rename(surname = lastname) %>% filter(gpa >= 3.5)
## surname firstname major year gpa
## 1 Lannister Tyrion Communications Sophomore 3.83
If, instead, I just wanted to sort the selected columns from highest to lowest gpa, I would use arrange.
got %>% rename(surname = lastname) %>% arrange(desc(gpa))
## surname firstname major year gpa
## 1 Lannister Tyrion Communications Sophomore 3.83
## 2 Drogo Khal Zoology Senior 3.38
## 3 Targaryen Daenerys Zoology Freshman 3.36
## 4 Snow John Nordic Studies Junior 3.23
## 5 Clegane Gregor Phys Ed Sophomore 3.23
## 6 Baelish Peter Communications Freshman 2.84
## 7 Stark Eddard History Senior 2.78
## 8 Tarly Samwise Nordic Studies Freshman 2.39
## 9 Bolton Ramsay Phys Ed Freshman 2.24
## 10 Baratheon Joffrey History Freshman 1.87
Suppose I wanted to create a dean’s list column called dlist and set it to TRUE if the gpa >= 3.5 and FALSE otherwise. I would use mutate for that. Note: in this example, the column is only created in the output and the data frame is unaltered.
got %>% rename(surname = lastname) %>% mutate(dlist = gpa >= 3.5)
## surname firstname major year gpa dlist
## 1 Snow John Nordic Studies Junior 3.23 FALSE
## 2 Lannister Tyrion Communications Sophomore 3.83 TRUE
## 3 Targaryen Daenerys Zoology Freshman 3.36 FALSE
## 4 Bolton Ramsay Phys Ed Freshman 2.24 FALSE
## 5 Stark Eddard History Senior 2.78 FALSE
## 6 Clegane Gregor Phys Ed Sophomore 3.23 FALSE
## 7 Baelish Peter Communications Freshman 2.84 FALSE
## 8 Baratheon Joffrey History Freshman 1.87 FALSE
## 9 Drogo Khal Zoology Senior 3.38 FALSE
## 10 Tarly Samwise Nordic Studies Freshman 2.39 FALSE
If I just wanted to show my transformed variables and no other variables, I could use transmute
got %>% transmute(name = paste(firstname, lastname), dlist = gpa >= 3.5)
## name dlist
## 1 John Snow FALSE
## 2 Tyrion Lannister TRUE
## 3 Daenerys Targaryen FALSE
## 4 Ramsay Bolton FALSE
## 5 Eddard Stark FALSE
## 6 Gregor Clegane FALSE
## 7 Peter Baelish FALSE
## 8 Joffrey Baratheon FALSE
## 9 Khal Drogo FALSE
## 10 Samwise Tarly FALSE
If we wanted to list the majors represented in the got data frame, we would use distinct, which restricts to unique(distinct) output.
got %>% distinct(major)
## major
## 1 Nordic Studies
## 2 Communications
## 3 Zoology
## 4 Phys Ed
## 5 History
Aggregation often adds the most complexity to a query and it is quite common to see summarize combined with group_by. For example, if we wanted to show the average gpa for each major, we would use group_by to declare that we are doing a calculation for each major and use summarize to define the mean calculation. You’ll notice that instead of a data frame, we are outputing a tibble, which is essentially an enhanced data frame that can store more complex data.
got %>% group_by(major) %>% summarize(average_gpa = mean(gpa))
## # A tibble: 5 x 2
## major average_gpa
## <chr> <dbl>
## 1 Communications 3.335
## 2 History 2.325
## 3 Nordic Studies 2.810
## 4 Phys Ed 2.735
## 5 Zoology 3.370
Suppose we wanted to show the name of the student with the highest gpa for each major. We could do this in a few different ways. In all cases, since we are doing it for each major, we will be using group_by(major). In the first case, after grouping, we sort in descending gpa order and slice out the first(1) instance of each student.
got %>% group_by(major) %>% arrange(desc(gpa)) %>% slice(1)
## Source: local data frame [5 x 5]
## Groups: major [5]
##
## lastname firstname major year gpa
## <chr> <chr> <chr> <chr> <dbl>
## 1 Lannister Tyrion Communications Sophomore 3.83
## 2 Stark Eddard History Senior 2.78
## 3 Snow John Nordic Studies Junior 3.23
## 4 Clegane Gregor Phys Ed Sophomore 3.23
## 5 Drogo Khal Zoology Senior 3.38
In the second case, we decide we want to use the top_n function.
got %>% group_by(major) %>% arrange(desc(gpa)) %>% top_n(1)
## Selecting by gpa
## Source: local data frame [5 x 5]
## Groups: major [5]
##
## lastname firstname major year gpa
## <chr> <chr> <chr> <chr> <dbl>
## 1 Lannister Tyrion Communications Sophomore 3.83
## 2 Drogo Khal Zoology Senior 3.38
## 3 Snow John Nordic Studies Junior 3.23
## 4 Clegane Gregor Phys Ed Sophomore 3.23
## 5 Stark Eddard History Senior 2.78
In the third case, we use the min_rank function within filter.
got %>% group_by(major) %>% filter(min_rank(desc(gpa)) == 1)
## Source: local data frame [5 x 5]
## Groups: major [5]
##
## lastname firstname major year gpa
## <chr> <chr> <chr> <chr> <dbl>
## 1 Snow John Nordic Studies Junior 3.23
## 2 Lannister Tyrion Communications Sophomore 3.83
## 3 Stark Eddard History Senior 2.78
## 4 Clegane Gregor Phys Ed Sophomore 3.23
## 5 Drogo Khal Zoology Senior 3.38
This should seem somewhat confusing and perhaps it is best to describe what is going on here. top_n, which we introduced last week, is an easier to use “wrapper” function that combines filter and min_rank. slice was added later to dplyr to make it simpler to not just select the top. For example, if I wanted to select positions 2 through 4, I would use slice(2:4) There is no equivalent top_n for this and I would end up resorting to the harder to follow filter(min_rank(...) %in c(2:4) To simplify, you should try to get comfortable with slice but feel free to use top_n as well.
As defined in the dplyr documentation, a window function is a variation on an aggregation function. Where an aggregation function, like sum() and mean(), takes n inputs and return a single value, a window function returns n values.
The window functions we’ll be dealing with in this class are often ranking functions (like min_rank()) and offset functions (like lag() which was introduced in the last unit). If you have ever worked with relational databases, window functions are commonly implemented in SQL.
The ranking and ordering functions you may use in dplyr are:
row_number()min_rank() which allows for gaps in ranks (e.g., if two rows are tied for first, the next rank is third)dense_rank() which doesn’t allow for gaps in ranks (e.g., if two rows are tied for first, the next rank is second)percent_rank() a number between 0 and 1 computed by rescaling min_rank to [0, 1].cume_dist() a cumulative distribution function. Proportion of all values less than or equal to the current rank.ntile() a rough rank, which breaks the input vector into n bucketsIf you look at a ranking of gpa’s in the got data, 3.23 is tied for fourth place and 2.84, which is the 6th row in the arranged data frame would be in sixth place using min_rank(), fifth place, using dense_rank()
got %>% filter(row_number(desc(gpa)) == 6)
## lastname firstname major year gpa
## 1 Baelish Peter Communications Freshman 2.84
got %>% filter(min_rank(desc(gpa)) == 6)
## lastname firstname major year gpa
## 1 Baelish Peter Communications Freshman 2.84
got %>% filter(dense_rank(desc(gpa)) == 5)
## lastname firstname major year gpa
## 1 Baelish Peter Communications Freshman 2.84
We could also use the slice verb to accomplish the same thing.
got %>% arrange(desc(gpa)) %>% slice(6)
## lastname firstname major year gpa
## 1 Baelish Peter Communications Freshman 2.84
We’ll add the columns p_rank, c_dist and ntile to show you how the remaining ranking functions work. We’ll use four buckets for ntile()
got %>% select(lastname, firstname, gpa) %>% arrange(desc(gpa)) %>%
mutate(p_rank = percent_rank(gpa), cdist = cume_dist(gpa),
ntile = ntile(gpa, 4))
## lastname firstname gpa p_rank cdist ntile
## 1 Lannister Tyrion 3.83 1.0000000 1.0 4
## 2 Drogo Khal 3.38 0.8888889 0.9 4
## 3 Targaryen Daenerys 3.36 0.7777778 0.8 3
## 4 Snow John 3.23 0.5555556 0.7 3
## 5 Clegane Gregor 3.23 0.5555556 0.7 3
## 6 Baelish Peter 2.84 0.4444444 0.5 2
## 7 Stark Eddard 2.78 0.3333333 0.4 2
## 8 Tarly Samwise 2.39 0.2222222 0.3 1
## 9 Bolton Ramsay 2.24 0.1111111 0.2 1
## 10 Baratheon Joffrey 1.87 0.0000000 0.1 1
The offset functions you may use in dplyr are:
lag() returns the previous value in the vector - introduced in the last unit.lead() returns the next value in a vector - the opposite of lag()If we wanted to know the gpa of the next better lag() and next worst lead() students I would use:
got %>% arrange(desc(gpa)) %>% mutate(nxt_better = lag(gpa), nxt_worst = lead(gpa))
## lastname firstname major year gpa nxt_better nxt_worst
## 1 Lannister Tyrion Communications Sophomore 3.83 NA 3.38
## 2 Drogo Khal Zoology Senior 3.38 3.83 3.36
## 3 Targaryen Daenerys Zoology Freshman 3.36 3.38 3.23
## 4 Snow John Nordic Studies Junior 3.23 3.36 3.23
## 5 Clegane Gregor Phys Ed Sophomore 3.23 3.23 2.84
## 6 Baelish Peter Communications Freshman 2.84 3.23 2.78
## 7 Stark Eddard History Senior 2.78 2.84 2.39
## 8 Tarly Samwise Nordic Studies Freshman 2.39 2.78 2.24
## 9 Bolton Ramsay Phys Ed Freshman 2.24 2.39 1.87
## 10 Baratheon Joffrey History Freshman 1.87 2.24 NA
We’ve covered a good portion of dplyr and most of what you’ll be using for the remainder of the semester. Time to help solidify your querying skills with some practice questions.
Starting with this unit, we are at a point where I’m going to give you a few practice (i.e., no stakes) questions where the code is folded (hidden). You should try your hardest to replicate the output but if you get stuck, you can always click on the code button to see the code.