R for Everything: Getting, Cleaning, Visualizing, and Analyzing Data

David Ranzolin
November 17, 2016

About Me

Background

English and Religious Studies
Scout ~ three years
useR for ~ two years
@RCatLadies for ~ two years

About UC Scout

Overview

UC's online high school program, serving several thousand California students and teachers each year.

Mission

Reach out to educationally disadvantaged students, raising achievement levels and closing achievement gaps.

News

Recipient of $4 million grant from state legislature to develop 45 new courses.

Valid Inferences?

Premises

R is flexible.
R is powerful.
R is affordable.
R is accessible.

Conclusions

You can use R.
You should use R?

About R

Accessibility

Free, open source

Community

Diverse (relative to other programming languages)
Attentive to beginners
Innovative

Tools

RStudio
tidyverse

The tidyverse

Philosophy

Data should be in a consistent form (data frame)
Each variable is a column
Sequence clear, logical steps

Installation

install.packages("tidyverse")
library(tidyverse) #loads readr, dplyr, tidyr, purrr, ggplot2, tibble

The tidyverse: readr

Purpose

Fast and friendly way to read rectangular data.

Usage

To…	Use…
Read delimited files	`read_delim()`, `read_csv()`, `read_tsv()`
Read lines	`read_lines()`
Read fixed width files	`read_fwf()`, `read_table()`

df = data_frame(x = 1:3, y = x^2, z = sample(letters, 3, replace = TRUE))
write_csv(df, "my_csv.csv")
read_csv("my_csv.csv")

# A tibble: 3 × 3
      x     y     z
  <int> <dbl> <chr>
1     1     1     v
2     2     4     z
3     3     9     i

The tidyverse: dplyr

Purpose

Manipulate data frames: filtering, selecting, mutating, etc.

Usage

To…	Use…
Select columns	`select()`
Subset rows	`filter()`
Create additional columns	`mutate()`
Calculate summary statistics	`summarize()`
Order rows	`arrange()`
Perform joins	`inner_join()`, `left_join()`, `anti_join()`, etc.
Group	`group_by()`

The tidyverse: dplyr

select(iris, Sepal.Length, Petal.Length, Species)

filter(iris, Sepal.Length > 7)

mutate(iris, sepal = Sepal.Length + Sepal.Width)

summarize(iris, avg = mean(Sepal.Length))

The pipe (%>%) chains tidyverse functions together:

iris %>% 
  filter(Sepal.Length > 4) %>% 
  group_by(Species) %>% 
  summarize(avg = mean(Sepal.Width)) %>% 
  arrange(desc(avg))

# A tibble: 3 × 2
     Species   avg
      <fctr> <dbl>
1     setosa 3.428
2  virginica 2.974
3 versicolor 2.770

The tidyverse: tidyr

Purpose

Package to tidy and reshape data.

Usage

To…	Use…
Make wide data long	`gather()`
Make long data wide	`spread()`

table4 %>% gather(year, cases, -country)

The tidyverse: purrr

Purpose

Work with lists and facilitate iteration.

Usage

To…	Use…
Apply a function to each element	`map()`, `map_*()`
Transpose a list	`transpose()`
Flatten a list	`flatten()`
Control error handling	`safely()`, `possibly()`

map_dbl(1:3, log, base = 2)

[1] 0.000000 1.000000 1.584963

The tidyverse: purrr

Using purrr to calculate Ed-Data's Ethnic Diversity Index (EDI)

edi <- function(df) {
  if (!is.data.frame(df)) stop("student_df must be a data frame")
  if (!"ethnicity" %in% names(df)) stop("ethnicity must be a column")
  ur <- c("Decline/Don't Know", "Other", "")
  ur_fraction <- sum(df$ethnicity %in% ur) /
                 sum(!df$ethnicity %in% ur)
  diversity_rating <- df %>% 
    filter(!ethnicity %in% ur) %>% 
    split(.$ethnicity) %>% 
    map(~ nrow(.)/nrow(df)/(1 - ur_fraction)) %>% 
    map_dbl(~ (. - (1/13))^2) %>%  #There are thirteen reported ethnicities
    sum(.) %>% 
    sqrt(.)
  c2 <- -100 * sqrt(13*(13-1))/(13-1)
  100 + (c2 * diversity_rating)
}

The tidyverse: ggplot2

plot of chunk unnamed-chunk-9

The tidyverse: tibble

Purpose

Data frames with nicer behavior around printing and subsetting

Usage

df1 <- tibble(x1 = 1:3, y1 = 1, z1 = x1 ^ 2 + y1)
df1

# A tibble: 3 × 3
     x1    y1    z1
  <int> <dbl> <dbl>
1     1     1     2
2     2     1     5
3     3     1    10

df1$x

Warning: Unknown column 'x'

NULL

Showcase #1: A tidyverse script

Requirement

Email each student earning less than 80%, as well as their parents and school counselors.

Solution

R, tidyverse, rcanvas, and gmailr.

Step 1: Get course data from LMS

library(rcanvas)

premium_courses <- get_course_list() %>% 
  filter(grepl("Premium", name))

get_emails_and_grades <- function(id) {
  emails <- get_course_items(id, "users", include = "email") %>% 
    select(name, sis_user_id, sis_login_id, email)
  grades <- get_course_items(id, "enrollments") %>% 
    filter(enrollment_state == "active") %>% 
    select(id, user.name, user.sis_user_id, user.sis_login_id, grades.current_score) %>% 
  left_join(emails, grades, by = c("sis_user_id" = "user.sis_user_id"))
}

safe_function <- safely(get_emails_and_grades)

student_data <- premium_courses$id %>% 
  map(safe_function) %>% 
  bind_rows() %>% 
  left_join(premium_courses, by = c("course_id.x" = "id")) %>%
  select(name, sis_user_id, email, course_id.x, grades.current_score, sis_course_id)

Step 2: Get and tidy student contact data from SIS (CSV file)

student_contact <- read_csv("student_contact.csv") %>% 
  select(sis_user_id = `Student ID`, Student, Question, Answer, email = Email) %>%
  spread(Question, Answer)

# A tibble: 3 × 5
  sis_user_id        Student                  Email   `Parent Email`
        <chr>          <chr>                  <chr>            <chr>
1    A0004325 David Ranzolin  dranzolin@ucscout.org info@ucscout.org
2    A0004375   Sajira Awang     sawang@ucscout.org info@ucscout.org
3    A0004925 Lisa Dominguez ldominguez@ucscout.org info@ucscout.org
# ... with 1 more variables: `Counselor Email` <chr>

Step 3: Join LMS and SIS data together

email_df <- student_contact %>% 
  left_join(student_data, by = "sis_user_id") %>% 
  filter(grades.current_score < 80) %>% 
  select(-Sections, -course_id.x) %>% 
  rename(counselor_email = `Counselor Email`,
         parent_email = `Parent Email`)

Step 4: Prepare email components

subject <- "UC Scout Weekly Grade Update"
email_sender <- 'UC Scout <info@ucscout.org>' 
body <- "Dear %s,

We're writing to inform you that your current grade in %s is %s. You can view your course progress in your Online Classroom (classroom.ucscout.org). 
Please let us know if you need any assistance. You can also contact your teacher with any further questions or concerns.

Best wishes,
The Scout Team"

email_df2 <- email_df %>%
  mutate(
    To = sprintf('%s <%s>, <%s>, <%s>', name, email, counselor_email, parent_email),
    From = email_sender,
    Subject = subject,
    Body = sprintf(body, name, sis_course_id, grades.current_score)) %>%
  select(To, From, Subject, Body)

Step 5: Send the emails!

library(gmailr)
emails <- email_df2 %>% 
  pmap(mime)

use_secret_file("client_secret_PROJ-NAME.json")

safe_send_message <- safely(send_message)
sent_mail <- emails %>% 
  map(safe_send_message)

Showcase #2: HTML Reports with rmarkdown

Requirement

Produce a reproducible report with plots, tables, and prose commentary.

Solution

R, rmarkdown, and knitr.

Link: http://rpubs.com/daranzolin/cair-2016-report

Showcase #3: Web application with flexdashboard and shiny

Requirement

Create and share an interactive dashboard.

Solution

R, shiny, and flexdashboard.

Link: https://daranzolin.shinyapps.io/cair2016app/

Other Fun R Things

Tufte-style handouts
Storyboards
gganimate
bookdown
tidytext
Templates for journals, CVs, etc.
Shiny Gadgets
Rstudio Add-ins
#rstats

Valid Inferences?

Premises

R is flexible.
R is powerful.
R is affordable.
R is accessible.

Inferences

You can use R.
You should use R?

David Ranzolin

Email: dranzolin@ucscout.org
Twitter: @daranzolin
Github: github.com/daranzolin
Blog: daranzolin.github.io