Using RStudio

RStudio is an Integrated Development Environment (IDE) for using the R language. It has a lot of useful tools for writing code, managing projects, viewing plots, and more.

RStudio is highly customizable, and allows you to change appearances and general layout without much fuss. For instance, to rearrange the order of panes, you can do the following:

  • RStudio > Preferences (Mac)
  • Tools > Options (Windows)

There are a lot of other features to take advantage of in RStudio, and many of them are bound to shortcuts. You can see those by going to Tools > Keyboard Shortcuts Help or pressing option + shift + K. You can also find a handy cheat sheet here.

Projects and Working Directories

Projects are an RStudio feature to help keep your code and working environments contained and organized, which comes in handy when you start to have multiple projects. To start a new project, you can click on the dropdown in the upper-righthand corner of RStudio and choose to begin a new project.

Even if you’re not using the RStudio projects feature, it’s still a good idea to keep work for any given project in a single directory (folder). You can make a new folder in Finder or File Explorer. Once you have that, you can set your working directory in R like this:

setwd("PATH/TO/PROJECT")

You can also see your current working directory by using this:

getwd()

R Basics

Creating Variables

You can use R to create variables. Variables can contain all kinds of information, from single values to entire data structures. We create variables using <-, called the assignment operator.

new_int <- 4 
new_int
## [1] 4

Variables will behave just like the values they represent. When we run cos() on new_int, it returns the same value as cos(4).

cos(new_int) 
## [1] -0.6536436
cos(4)
## [1] -0.6536436

Functions

Functions are ways of running the same piece of code on something that changes. It can save us a lot of typing - one useful way of thinking says that if you have to copy and paste the same code three times, you should write a function instead. Let’s try writing a simple function to show how this can work.

new_fun <- function(x) { 
  my_int <- x 
  your_int <- my_int * 2 
  cat("My integer is", my_int, "and your integer is", your_int)
}

Now it’s ready to be run!

new_fun(4)
## My integer is 4 and your integer is 8
new_fun(8)
## My integer is 8 and your integer is 16
new_fun(87732)
## My integer is 87732 and your integer is 175464

Exploring data

There are some functions and datsets built into R already. Let’s explore some a bit using a built-in dataset, mtcars.

data(mtcars)
mtcars

We can find out some things about the basic structure of our data.

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
dim(mtcars)
## [1] 32 11
nrow(mtcars)
## [1] 32
ncol(mtcars)
## [1] 11
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

We can use specific parts of the data, too, such as the mpg variable. Then we can find out more about that with some built-in functions.

mtcars$mpg
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
length(mtcars$mpg)
## [1] 32
mean(mtcars$mpg)
## [1] 20.09062
median(mtcars$mpg)
## [1] 19.2
prod(mtcars$mpg)
## [1] 1.264241e+41
sum(mtcars$mpg)
## [1] 642.9
sqrt(mtcars$mpg)
##  [1] 4.582576 4.582576 4.774935 4.626013 4.324350 4.254409 3.781534 4.939636
##  [9] 4.774935 4.381780 4.219005 4.049691 4.159327 3.898718 3.224903 3.224903
## [17] 3.834058 5.692100 5.513620 5.822371 4.636809 3.937004 3.898718 3.646917
## [25] 4.381780 5.224940 5.099020 5.513620 3.974921 4.438468 3.872983 4.626013
var(mtcars$mpg)
## [1] 36.3241

Packages

People in different disciplines may write functions or create datasets that will be useful to others in their field. They can bundle these functions and datasets into packages that can be downloaded and used by others. There is a simple way into install packages with R:

install.packages(c("tidyverse", "nycflights13"))

Those packages are now installed on our computers, but we haven’t made the functions and data contained in them accesible to ourselves yet. To do that, we need another function:

library(tidyverse)
library(nycflights13)

We’ll get to those a little later!

Tidying Data

First, we’re going to read in some untidy data to practice some “pivoting” and “separating.”

cust <- read_csv("http://dartgo.org/untidy_csv")

This dataset contains information about how much each customer spent each season. If we remember our tidy data principles, each variable should be a column, each row should be an observation. What are the variables in this dataset?

Answer:

  • customer
  • year
  • season
  • spent

There are (mainly) two things that are messy about this dataset. The first is that values are stored as column names, and the second (once the first is resolved) is that we’ve got two variables stored in a single column/set of values.

We can fix that with a couple of functions from the tidyr package.

cust_long <- pivot_longer(cust, cols = 2:17, names_to = "year_season", values_to = "spent")

If we want to separate out the season and year, we can do that, too.

cust_tidy <- separate(cust_long, year_season, into = c("year", "season"), sep = "_")

Or, if we want a single line of code:

cust_tidy <- separate(pivot_longer(cust, cols = 2:17, names_to = "year_season", values_to = "spent"), year_season, into = c("year", "season"), sep = "_")

Using the pipe %>%, this can now become

cust_tidy <- cust %>%
  pivot_longer(cols = 2:17, names_to = "year_season", values_to = "spent") %>%
  separate(year_season, into = c("year", "season"), sep = "_")

Single-table verbs

Now that we’ve loaded in the tidyverse, we have access to some of the data within it. We’re going to start off with some data about Star Wars to practice working with data sets.

data(starwars)

glimpse(starwars)
## Observations: 87
## Variables: 13
## $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia O…
## $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, …
## $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77…
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", …
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", …
## $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue"…
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0,…
## $ gender     <chr> "male", NA, NA, "male", "female", "male", "female", NA, "m…
## $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "…
## $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Hum…
## $ films      <list> [<"Revenge of the Sith", "Return of the Jedi", "The Empir…
## $ vehicles   <list> [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "I…
## $ starships  <list> [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1…

Subsetting

The tidyverse (and specifically dplyr) come with a few very useful functions for manipulating data. Let’s try doing some subsetting, which can be by variable (column) or by observation (row). Note that this is just one of many ways that we can use R for subsetting data. select() allows us to choose which columns to keep.

heights <- select(starwars, name, height)

We can also use the filter() function to subset by row.

humans <- filter(starwars, species == "Human")

We can filter on multiple criteria, too.

tall_humans <- filter(starwars, species == "Human" & height >= 180)

If we want to subset data using both, we can do that, too.

human_homeworlds <- select(filter(starwars, species == "Human"), name, homeworld, height)

There’s a more clear way to write this, though, using the pipe %>%.

human_homeworlds <- starwars %>%
  filter(species == "Human") %>%
  select(name, homeworld, height)

Activity (5 minutes)

Using filter(), select(), and the pipe (%>%), can you get a list of characters from Tattooine, their hair colors, and their eye colors? How about just humans from Tatooine?

tatooine <- starwars %>%
  filter(homeworld == "Tatooine") %>%
  select(name, homeworld, hair_color, eye_color)




tat_humans <- starwars %>%
  filter(homeworld == "Tatooine" & species == "Human") %>%
  select(name, homeworld, hair_color, eye_color)

Grouping / summarizing data

We can also use dplyr to get some statistics that are important to us using group_by() and summarize()

species_heights <- starwars %>%
  group_by(species) %>%
  summarize(height = mean(height))

Something’s not right there…humans don’t have an average height. If we look at starwars, it’s because of the NA values for some humans. The default behavior of mean() is to return NA any time there’s a missing value in the list of things provided to it. It has an optional argument, though, that can fix this.

species_heights <- starwars %>%
  group_by(species) %>%
  summarize(height = mean(height, na.rm = TRUE))

We can also see how many observations were included in each group by adding another variable.

species_heights <- starwars %>%
  group_by(species) %>%
  summarize(height = mean(height, na.rm = TRUE), n_obs = n())

This also works if we want to group by more than one variable

species_heights <- starwars %>%
  group_by(species, gender) %>%
  summarize(height = mean(height, na.rm = TRUE), n_obs = n())

We can also create more than one variable using summarize()

species_heights <- starwars %>%
  group_by(species, gender) %>%
  summarize(height = mean(height, na.rm = TRUE), 
            tallest = max(height, na.rm = TRUE),
            n_obs = n())

Activity (5 minutes)

Using group_by, and summarize(), get the mean and median mass of each species.

masses <- starwars %>%
  group_by(species) %>%
  summarize(mean_mass = mean(mass, na.rm = TRUE),
            median_mass = median(mass, na.rm = TRUE))

Let’s take a look at our flights data to lead into the next activity

data(flights)

Activity (5 minutes)

Using the flights data from nycflights13, find the mean departure delay (dep_delay) by departing airport (origin).

mean_delays <- flights %>%
  group_by(origin) %>%
  summarize(mean_delay = mean(dep_delay, na.rm = TRUE))

Mutate

One last thing we’ll use is mutate(), which lets us add on new columns.

double_height <- starwars %>%
  mutate(height_double = height * 2)

Two-table verbs

We’re mostly going to be using our flights data (along with a couple of others) for this part. Two-table verbs from dplyr allow us to combine multiple tables into one, much like we did in the class on database design.

Two-table verbs, as the name suggests, require more than one table. So let’s also load in some other datasets that can be paired to the flights data.

data(airlines)
data(weather)
data(airports)

Now, let’s subset our flights down to only keep the variables that will help us understand how our joins work.

flights2 <- flights %>%
  select(year, month, day, hour, origin, dest, tailnum, carrier)

Testing joins

result <- flights2 %>% 
  left_join(airlines)
## Joining, by = "carrier"

We used a left join here - it keeps every row from the left-hand column and brings in data that matches from the right-hand column. This is one of several types of joins. dplyr borrows their terminology around data joins from SQL.

Let’s try another join

result <- flights2 %>% 
  left_join(weather)
## Joining, by = c("year", "month", "day", "hour", "origin")

So far these have een “natural joins,” which is to say that dplyr uses all variables that appear in both tables. This can cause problems when you have two tables with same-name variables that contain different information. For instance:

result <- flights2 %>%
  left_join(planes)
## Joining, by = c("year", "tailnum")

Chat question

In the chat: why did this cause an issue?

Here’s how we resolve it:

result <- flights2 %>%
  left_join(planes, by = "tailnum")

Another problem arises if the matching variable between two tables has a different name, as with flights and airports (“origin” / “dest” and “faa” in each table, respectively).

result <- flights2 %>%
  left_join(airports)
## Error: `by` required, because the data sources have no common variables

When this comes up, we can resolve it by specifying matching variables with a named character vector:

result <- flights2 %>%
  left_join(airports, by = c("origin" = "faa"))

result <- flights2 %>%
  left_join(airports, by = c("dest" = "faa"))

Other “mutating” joins

We’ve been using a left_join so far, but there are a few others worth exploring. Let’s do this with smaller datasets to see how they work.

data(band_instruments)
data(band_members)

right_join(x, y) is basically the equivalent of left_join(y, x).

left_example <- left_join(band_members, band_instruments)
right_example <- right_join(band_members, band_instruments)

inner_join is a way to only keep rows that match in both tables

inner_example <- inner_join(band_members, band_instruments)

full_join is a way to keep all rows from both tables, even when there isn’t a match

full_example <- full_join(band_members, band_instruments)

“Filtering” joins

Filtering joines keep rows from the left-hand data frame, and are often used to help you make determinations about your data before joining another way.

A semi-join will give you a result that shows all of the rows in the left-hand table which have matches in the right-hand table, but doesn’t bring in the extra columns from the right-hand table.

semi_example <- semi_join(band_members, band_instruments)

An anti-join is almost the opposite inner-join - it will return all rows of the left-hand table that don’t have matches in the right-hand table This is useful for understanding what would be excluded if you were to perform an inner-join.

anti_example <- anti_join(band_members, band_instruments)

Sources

This lesson uses material from the following: