Cronkite R Cheat Sheet

The Verbs

This verb Does this Example Notes
arrange sorts or orders the by the columns listed arrange ( last_name, first_name) use arrange (desc (col name) ) to go from highest to lowest
select Picks out columns to use in this chunk select ( newname=last_name, first_name, salary)
select (1:3) to get the first three columns
use back-ticks to enclose multi-word column names; generally use it last in the list of commands because you have to have selected something before you can use it anywhere else. Names are case-sensitive. You can rename at the same time.
filter Picks out rows to use in this chunk filter (last_name == “Batts”) Case-sensitive. Don’t forget the double-equals. Use != for “not equals” or ! before any condition to negate it.
mutate create new columns from old ones mutate (
   diff = salary2021 - salary2020,
   pct_diff = diff / salary2020 * 100
)
See below for common functions and formulas used in mutate statements
group_by creates piles out of the data, usually used with summarise. group_by (last_name) If you group by more than one column, the subsequent commands work on the next level up. (See below). To makes sure that it goes back to normal, use group_by (last_name, first_name, .groups=“drop”)
summarise computes summary statistics like the count, sum, or median of values summarise ( no_of_rows = n(), sum_of_values = sum(salary, na.rm = TRUE), first_row_found = first(salary) Your resulting dataset will include only the group_by and these summary columns, no other details. Using it without a group_by will return one row with the statistics for the whole data frame.
inner_join, left_join combines data frames by common columns my_data %>% inner_join (your_data, by=c(“column_from_first_table”=“column_from_2nd_table”) Use inner_join to get rows that match exactly; use left_join to keep all of the rows from the first data frame and only those that match in the second. Try to rename columns that have the same names but aren’t used in the match, or you’ll end up with weird column names.
rename renames columns rename (new_name = old_name) I always do it backwards.

For all of these verbs, the beginning parentheses has to be on the same line as the verb. So

    select 
       (last_name)" 

will generate an error

Common errors and how to troubleshoot them

object not found

There are a gazillion reasons for this error, but R is basically telling you that it doesn’t have access to something you’re asking for. Common reasons for it are:

  • Spelling. Spelling and case-sensitivity. Spelling. The error (red) will be on the same line as the error.
  • Forgetting %>% between verbs. The error will be on the line below where the %>% is missing.
  • Forgetting to start with an existing data frame, unless you’re importing or loading data.
  • Forgetting to pour your answer into a new data frame using the <- before you start.
  • Using a library like the tidyverse without activating it using the library(tidyverse) command at the beginning. Don’t forget to run that setup chunk. (Hint: When you start up a document, use the Run All to get back to where you were.)
  • Trying to pick up where you left off if you close out R or knit a document. Try Run All to fix it.

Knitting error

This can come from any error in the document, and it’s hard to troubleshoot. Some hints:

  • Remember that markdown knitting starts from scratch. Even if you can see something in your environment, R Markdown knitting starts from a fresh environment with nothing in it.
  • If you see a “YAML” error, it means there’s something wrong in the very picky instructions at the very top between the — lines. Indentation there is important. Just try to copy it from one that works.
  • Errors in your code will be shown among the junk you see. Double-click on the error message to take you to where it had trouble.
  • If you really can’t figure it out and just want to see what happens, add error=TRUE to the part in the setup just after where you see echo=TRUE in our template.

Error in syntax

Common problems include: * Using = instead of == when using a condition * Punctuation within a verb phrase. * It can be hard to figure out this one otherwise - if you can’t see it right away, ask for help. You’ll get used to it.

Common functions used within the verbs

Functions can be used inside of the verbs so you’re not stuck with just what you see. A function is a sub-verb. It requires a noun (column) to work on, and some arguments ( specific instructions) to make them work. The “help” on these is almost impossible to read. Instead, try googling it for examples or ask for help. Many require libraries to be loaded.

Text (string) functions

You can look some of these up, but just know they exist in case you need them and ask for help:

toupper( col_name ) or tolower( col_name ) : Convert to all lower or upper case.

str_detect( col_name, pattern) : Instead of == so you can look for several things at once or inexact matches (See below)

splitting or extracting functions: These are ways to split apart phrases into more than one column, or pull out key phrases. They’re hard to use, but they exist. Get help if you want to do this.

Non-exact matches in filtering / mutate with str_detect()

This shouldn’t be so hard. In other languages, there are “wildcards” to look for phrases and you can specify whether it’s case-sensitive. Not in R.

Use str_detect() within a filter (to pick out rows that match) or mutate (to create new variables if there is a match). It detects whether or not a string or text variable contains what you’re looking for. It’s very powerful because it uses regular expression (which we haven’t gone over yet). But at its simplest:

str_detect (what column are you looking at?, 
            what pattern should it have?)

This does NOT address case-sensitivity. You have to either

Partial match:

filter ( str_detect ( col_name, "phrase") ) 

This will still be case-sensitive. To fix that:

filter ( str_detect ( tolower(col_name), "phrase") 

patterns include :

  • actual phrase or text, like “Smith” in a name field
  • phrase in the beginning of the column “^Smith”)
  • phrase at the end of the column (“Smith$”)
  • Anything between two phrases ("J.*Smith")
  • Anything EXCEPT what you want (“Smith”, negate=TRUE)
  • Any numerical digit (“\ or”[0-9]")

We’ll have a module on regular expressions later on. Let me know if you need to use one now, because they’re very very powerful.

Dealing with missing data

An NA in the data propagates. That means that anything that is NA will create NA in anything it touches, including sums.

  • Find rows with NA in a column: filter (is.na(col_name))
  • Ignore NA’s in summary functions: sum(col_name, na.rm=TRUE). This treats them as zero, which could be a problem if you combine with a count.
  • Keep only rows with key fields filled in: filter (!is.na(col1) & !is.na(col2))

Importing or loading data

For data that starts as a spreadsheet or a CSV or other text file, use

    read_csv (comma-separated text file. Very common)
    read_tsv (tab-separated text file)
    read_delim (text file delimited by something else)
    read_excel (an XLSX file. It won't work on very old spreadsheets)
    

There is probably a library or a function that will read almost any data you run into. Below is a complex example that tells you how to change names, make sure that it reads columns as the TYPE you want, and deal with special issues in importing.

There are two forms of native R files, which are loaded differently:

.Rda or .Rdata: These are files that are sort of like zipped files, and can contain lots of objects. To get them into your environment:

    load("folder/filename.ext") 

To get it from a web location, use :

load (url ( "https://....") )

They will create all of the objects (data frames, variables, etc) with the names that the person that made them wanted.

.RDS files: These are single files, and you have to tell R what you want to call it in your environment:

    mydata <- readRDS("path_to_filename")

Conditional functions

if_else ()

Choose between two conditions:

  if_else ( what condition -- like a filter -- do you want to test?,
            what is the answer if it's true?,
            what is the answer if it's false?)

Real-life example:

    mutate ( in_my_county = 
                if_else ( county == "MARICOPA", 
                          "In Maricopa", 
                          "Not in Maricopa") 
          ) 
            

case_when ()

The same idea, but you can check all kinds of things at once. It’s hard to understand, but if you find yourself with lots of if_then statements, a case_when will probably be easier. Get help.

    case_when ( first condition ~ what to do if true,
                second condition ~ what to do if true,
                ....
                TRUE ~ what to do if nothing else is true)

If you leave out the last TRUE~ ending, it will set anything that met none of the conditions to NA.

This can be used on lots of different columns at the same time, so it can be quite powerful.

Dates and times

Usually, you’ll use the lubridate library to deal with dates and times. Here are some examples:

Convert text to date, using a common format

my_date = mdy(“1/24/2021”) will turn it into a date. When you import

Misc things that are common

clean_names()

Part of the janitor library, converts all names to lower case with no spaces (which are converted to "_")

pivot_longer; pivot_wider

Turn a wide many-columned data frame into a long few-columned data frame (longer) and convert a long into a wide one.

It’s common to convert everything to long when you’re doing your work, then pivot it back to wide when you want to print out the results.

It’s a bit tricky, but just know it exists.