RStudio is an Integrated Development Environment (IDE) for using the R language. It has a lot of useful tools for writing code, managing projects, viewing plots, and more.
RStudio is highly customizable, and allows you to change appearances and general layout without much fuss. For instance, to rearrange the order of panes, you can do the following:
RStudio > Preferences (Mac)Tools > Options (Windows)There are a lot of other features to take advantage of in RStudio, and many of them are bound to shortcuts. You can see those by going to Tools > Keyboard Shortcuts Help or pressing option + shift + K. You can also find a handy cheat sheet here.
Projects are an RStudio feature to help keep your code and working environments contained and organized, which comes in handy when you start to have multiple projects. To start a new project, you can click on the dropdown in the upper-righthand corner of RStudio and choose to begin a new project.
Even if you’re not using the RStudio projects feature, it’s still a good idea to keep work for any given project in a single directory (folder). You can make a new folder in Finder or File Explorer. Once you have that, you can set your working directory in R like this:
setwd("PATH/TO/PROJECT")
You can also see your current working directory by using this:
getwd()
You can use R to create variables. Variables can contain all kinds of information, from single values to entire data structures. We create variables using <-, called the assignment operator.
new_int <- 4
new_int
## [1] 4
Variables will behave just like the values they represent. When we run cos() on new_int, it returns the same value as cos(4).
cos(new_int)
## [1] -0.6536436
cos(4)
## [1] -0.6536436
Functions are ways of running the same piece of code on something that changes. It can save us a lot of typing - one useful way of thinking says that if you have to copy and paste the same code three times, you should write a function instead. Let’s try writing a simple function to show how this can work.
new_fun <- function(x) {
my_int <- x
your_int <- my_int * 2
cat("My integer is", my_int, "and your integer is", your_int)
}
Now it’s ready to be run!
new_fun(4)
## My integer is 4 and your integer is 8
new_fun(8)
## My integer is 8 and your integer is 16
new_fun(87732)
## My integer is 87732 and your integer is 175464
There are some functions and datsets built into R already. Let’s explore some a bit using a built-in dataset, mtcars.
data(mtcars)
mtcars
We can find out some things about the basic structure of our data.
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
dim(mtcars)
## [1] 32 11
nrow(mtcars)
## [1] 32
ncol(mtcars)
## [1] 11
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
We can use specific parts of the data, too, such as the mpg variable. Then we can find out more about that with some built-in functions.
mtcars$mpg
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
length(mtcars$mpg)
## [1] 32
mean(mtcars$mpg)
## [1] 20.09062
median(mtcars$mpg)
## [1] 19.2
prod(mtcars$mpg)
## [1] 1.264241e+41
sum(mtcars$mpg)
## [1] 642.9
sqrt(mtcars$mpg)
## [1] 4.582576 4.582576 4.774935 4.626013 4.324350 4.254409 3.781534 4.939636
## [9] 4.774935 4.381780 4.219005 4.049691 4.159327 3.898718 3.224903 3.224903
## [17] 3.834058 5.692100 5.513620 5.822371 4.636809 3.937004 3.898718 3.646917
## [25] 4.381780 5.224940 5.099020 5.513620 3.974921 4.438468 3.872983 4.626013
var(mtcars$mpg)
## [1] 36.3241
People in different disciplines may write functions or create datasets that will be useful to others in their field. They can bundle these functions and datasets into packages that can be downloaded and used by others. There is a simple way into install packages with R:
install.packages(c("tidyverse", "nycflights13"))
Those packages are now installed on our computers, but we haven’t made the functions and data contained in them accesible to ourselves yet. To do that, we need another function:
library(tidyverse)
library(nycflights13)
We’ll get to those a little later!
First, we’re going to read in some untidy data to practice some “pivoting” and “separating.”
cust <- read_csv("http://dartgo.org/untidy_csv")
This dataset contains information about how much each customer spent each season. If we remember our tidy data principles, each variable should be a column, each row should be an observation. What are the variables in this dataset?
Answer:
customeryearseasonspentThere are (mainly) two things that are messy about this dataset. The first is that values are stored as column names, and the second (once the first is resolved) is that we’ve got two variables stored in a single column/set of values.
We can fix that with a couple of functions from the tidyr package.
cust_long <- pivot_longer(cust, cols = 2:17, names_to = "year_season", values_to = "spent")
If we want to separate out the season and year, we can do that, too.
cust_tidy <- separate(cust_long, year_season, into = c("year", "season"), sep = "_")
Or, if we want a single line of code:
cust_tidy <- separate(pivot_longer(cust, cols = 2:17, names_to = "year_season", values_to = "spent"), year_season, into = c("year", "season"), sep = "_")
Using the pipe %>%, this can now become
cust_tidy <- cust %>%
pivot_longer(cols = 2:17, names_to = "year_season", values_to = "spent") %>%
separate(year_season, into = c("year", "season"), sep = "_")
Now that we’ve loaded in the tidyverse, we have access to some of the data within it. We’re going to start off with some data about Star Wars to practice working with data sets.
data(starwars)
glimpse(starwars)
## Observations: 87
## Variables: 13
## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia O…
## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, …
## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77…
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", …
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", …
## $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue"…
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0,…
## $ gender <chr> "male", NA, NA, "male", "female", "male", "female", NA, "m…
## $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "…
## $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Hum…
## $ films <list> [<"Revenge of the Sith", "Return of the Jedi", "The Empir…
## $ vehicles <list> [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "I…
## $ starships <list> [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1…
The tidyverse (and specifically dplyr) come with a few very useful functions for manipulating data. Let’s try doing some subsetting, which can be by variable (column) or by observation (row). Note that this is just one of many ways that we can use R for subsetting data. select() allows us to choose which columns to keep.
heights <- select(starwars, name, height)
We can also use the filter() function to subset by row.
humans <- filter(starwars, species == "Human")
We can filter on multiple criteria, too.
tall_humans <- filter(starwars, species == "Human" & height >= 180)
If we want to subset data using both, we can do that, too.
human_homeworlds <- select(filter(starwars, species == "Human"), name, homeworld, height)
There’s a more clear way to write this, though, using the pipe %>%.
human_homeworlds <- starwars %>%
filter(species == "Human") %>%
select(name, homeworld, height)
Using filter(), select(), and the pipe (%>%), can you get a list of characters from Tattooine, their hair colors, and their eye colors? How about just humans from Tatooine?
tatooine <- starwars %>%
filter(homeworld == "Tatooine") %>%
select(name, homeworld, hair_color, eye_color)
tat_humans <- starwars %>%
filter(homeworld == "Tatooine" & species == "Human") %>%
select(name, homeworld, hair_color, eye_color)
We can also use dplyr to get some statistics that are important to us using group_by() and summarize()
species_heights <- starwars %>%
group_by(species) %>%
summarize(height = mean(height))
Something’s not right there…humans don’t have an average height. If we look at starwars, it’s because of the NA values for some humans. The default behavior of mean() is to return NA any time there’s a missing value in the list of things provided to it. It has an optional argument, though, that can fix this.
species_heights <- starwars %>%
group_by(species) %>%
summarize(height = mean(height, na.rm = TRUE))
We can also see how many observations were included in each group by adding another variable.
species_heights <- starwars %>%
group_by(species) %>%
summarize(height = mean(height, na.rm = TRUE), n_obs = n())
This also works if we want to group by more than one variable
species_heights <- starwars %>%
group_by(species, gender) %>%
summarize(height = mean(height, na.rm = TRUE), n_obs = n())
We can also create more than one variable using summarize()
species_heights <- starwars %>%
group_by(species, gender) %>%
summarize(height = mean(height, na.rm = TRUE),
tallest = max(height, na.rm = TRUE),
n_obs = n())
Using group_by, and summarize(), get the mean and median mass of each species.
masses <- starwars %>%
group_by(species) %>%
summarize(mean_mass = mean(mass, na.rm = TRUE),
median_mass = median(mass, na.rm = TRUE))
Let’s take a look at our flights data to lead into the next activity
data(flights)
Using the flights data from nycflights13, find the mean departure delay (dep_delay) by departing airport (origin).
mean_delays <- flights %>%
group_by(origin) %>%
summarize(mean_delay = mean(dep_delay, na.rm = TRUE))
One last thing we’ll use is mutate(), which lets us add on new columns.
double_height <- starwars %>%
mutate(height_double = height * 2)
We’re mostly going to be using our flights data (along with a couple of others) for this part. Two-table verbs from dplyr allow us to combine multiple tables into one, much like we did in the class on database design.
Two-table verbs, as the name suggests, require more than one table. So let’s also load in some other datasets that can be paired to the flights data.
data(airlines)
data(weather)
data(airports)
Now, let’s subset our flights down to only keep the variables that will help us understand how our joins work.
flights2 <- flights %>%
select(year, month, day, hour, origin, dest, tailnum, carrier)
result <- flights2 %>%
left_join(airlines)
## Joining, by = "carrier"
We used a left join here - it keeps every row from the left-hand column and brings in data that matches from the right-hand column. This is one of several types of joins. dplyr borrows their terminology around data joins from SQL.
Let’s try another join
result <- flights2 %>%
left_join(weather)
## Joining, by = c("year", "month", "day", "hour", "origin")
So far these have een “natural joins,” which is to say that dplyr uses all variables that appear in both tables. This can cause problems when you have two tables with same-name variables that contain different information. For instance:
result <- flights2 %>%
left_join(planes)
## Joining, by = c("year", "tailnum")
In the chat: why did this cause an issue?
Here’s how we resolve it:
result <- flights2 %>%
left_join(planes, by = "tailnum")
Another problem arises if the matching variable between two tables has a different name, as with flights and airports (“origin” / “dest” and “faa” in each table, respectively).
result <- flights2 %>%
left_join(airports)
## Error: `by` required, because the data sources have no common variables
When this comes up, we can resolve it by specifying matching variables with a named character vector:
result <- flights2 %>%
left_join(airports, by = c("origin" = "faa"))
result <- flights2 %>%
left_join(airports, by = c("dest" = "faa"))
We’ve been using a left_join so far, but there are a few others worth exploring. Let’s do this with smaller datasets to see how they work.
data(band_instruments)
data(band_members)
right_join(x, y) is basically the equivalent of left_join(y, x).
left_example <- left_join(band_members, band_instruments)
right_example <- right_join(band_members, band_instruments)
inner_join is a way to only keep rows that match in both tables
inner_example <- inner_join(band_members, band_instruments)
full_join is a way to keep all rows from both tables, even when there isn’t a match
full_example <- full_join(band_members, band_instruments)
Filtering joines keep rows from the left-hand data frame, and are often used to help you make determinations about your data before joining another way.
A semi-join will give you a result that shows all of the rows in the left-hand table which have matches in the right-hand table, but doesn’t bring in the extra columns from the right-hand table.
semi_example <- semi_join(band_members, band_instruments)
An anti-join is almost the opposite inner-join - it will return all rows of the left-hand table that don’t have matches in the right-hand table This is useful for understanding what would be excluded if you were to perform an inner-join.
anti_example <- anti_join(band_members, band_instruments)
This lesson uses material from the following: