Visualization Coding Exercise #3 (Homework #3)

In class you learned about the built-in R data set, iris. You were introduced to iris.wide, iris.wide2 and iris.tidy. Produce the code to create these data sets. Remember, the course notes for week #4 lecture will have screen shots of what the final data sets should look like.

Hint: problem 3c will help figure out how to create iris.tidy and 4c will help you figure out how to create iris.wide. See what you can figure out on your own to create iris.wide2!

library(tidyr)

## Warning: package 'tidyr' was built under R version 3.4.4

iris$Flower <- 1:nrow(iris)
iris.wide <- iris %>%
  gather(key, value, -Flower, -Species) %>%
  separate(key, c("Part", "Measure"), "\\.") %>%
  spread(Measure, value)
#delete this and insert your code to create iris.wide2

iris.tidy <- iris %>%
  gather(key, Value, -Species) %>%
  separate(key, c("Part", "Measure"), "\\.")

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 150 rows
## [601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 611, 612, 613, 614, 615,
## 616, 617, 618, 619, 620, ...].

In the week #4 lecture, you were shown different ggplot2 calls to plot two groups of data onto the same plot.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.4.4

# Option 1
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point() +
  geom_point(aes(x = Petal.Length, y = Petal.Width), col = "red")

# Option 2
ggplot(iris.wide, aes(x = Length, y = Width, col = Part)) +
  geom_point()

Which one is preferable? You have access to iris and you created iris.wide in problem #1 so you can experiment with both of these pieces of code. State Option 1 or Option 2 as your final answer.

So far you’ve seen four different forms of the iris dataset: iris, iris.wide, iris.wide2 and iris.tidy. Don’t let all these different forms confuse you! It’s exactly the same data, just rearranged so that your plotting functions become easier.

To see this in action,

Consider the plot shown in Figure 1 of the Word document associated with this assignment. Which form of the dataset would be the most appropriate to use to create the graph shown in Figure 1? Remeber, you can use str() to look at the structures of the different data frames. Choose one answer from: iris, iris.wide, iris.wide2 or iris.tidy. State one and only one data frame.

Hint: working through the parts below will help you answer part a.

Fill in the ggplot function with the appropriate data frame and variable names. The variable names of the aesthetics of the plot will match the ones you found using the str() command in the previous part a.

library(ggplot2)
# Think about which dataset you would use to get the plot shown in Figure 1 
# Fill in the ___ to produce the plot shown in Figure 1
ggplot(iris.wide, aes(x = Length, y = Width, color = Part)) +
  geom_jitter() +
  facet_grid(. ~ Species)

In part b, you saw how iris.tidy was used to make a specific plot. It is important to know how to rearrange your data in this way so that your plotting functions become easier. In this exercise you will use functions from the tidyr package to convert iris to iris.tidy.

The resulting iris.tidy data should look as follows:

  Species  Part Measure Value
1  setosa Sepal  Length   5.1
2  setosa Sepal  Length   4.9
3  setosa Sepal  Length   4.7
4  setosa Sepal  Length   4.6
5  setosa Sepal  Length   5.0
6  setosa Sepal  Length   5.4
...

You can have a look at the iris dataset by typing head(iris) in the Console.

If you’re not familiar with %>%, gather() and separate(), keep reading. In a nutshell, a dataset is called tidy when every row is an observation and every column is a variable. The gather() function moves information from the columns to the rows. It takes multiple columns and gathers them into a single column by adding rows. The separate() function splits one column into two or more columns according to a pattern you define. Lastly, the %>% (or “pipe”) operator passes the result of the left-hand side as the first argument of the function on the right-hand side.

You’ll use two functions from the tidyr package. Make sure you have installed the tidyr package.

gather() rearranges the data frame by specifying the columns that are categorical variables with a - notation. Complete the command in the r code chunk below. Notice that only one variable is categorical in iris.
separate() splits up the new key column, which contains the former headers, according to . . The new column names “Part” and “Measure” are given in a character vector. Don’t forget the quotes.

# Load the tidyr package
library(tidyr)

# Fill in the ___ to produce to the correct iris.tidy dataset
iris.tidy <- iris %>%
  gather(key, Value, -Species) %>%
  separate(key, c("Part", "Measure"), "\\.")

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 150 rows
## [601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 611, 612, 613, 614, 615,
## 616, 617, 618, 619, 620, ...].

Take a look at another plot variant, shown at in Figure 2 of the Word document associated with this assignment. Which form of the dataset would be the most appropriate to use to create the graph shown in Figure 2? Remeber, you can use str() to look at the structures of the different data frames. Choose one answer from: iris, iris.wide, iris.wide2 or iris.tidy. State one and only one data frame.

Hint: working through the parts below will help you answer part a.

Look at the heads of iris, iris.wide and iris.tidy using head(). Fill in the ggplot function with the appropriate data frame and variable names. The names of the aesthetics of the plot will match with variable names in your dataset. The instruction using the head() function will help you match variable names in datasets with the ones in the plot.

library(ggplot2)

# Think about which dataset you would use to get the plot in Figure 2
# Fill in the ___ to produce the plot in Figure 2
ggplot(iris.wide, aes(x = Length, y = Width, color = Part)) +
  geom_jitter() +
  facet_grid(. ~ Species)

In part b,you saw how iris.wide was used to make a specific plot. You also saw previously how you can derive iris.tidy from iris. Now you will move on to produce iris.wide.

The head of the iris.wide should look like this in the end:

Species Part Length Width 1 setosa Petal 1.4 0.2 2 setosa Petal 1.4 0.2 3 setosa Petal 1.3 0.2 4 setosa Petal 1.5 0.2 5 setosa Petal 1.4 0.2 6 setosa Petal 1.7 0.4 …

You can have a look at the iris dataset by typing head(iris) in the console.

Before you begin, you need to add a new column called Flower that contains a unique identifier for each row in the data frame. This is because you’ll rearrange the data frame afterwards and you need to keep track of which row, or which specific flower, each value came from. It’s done for you, no need to add anything yourself.

gather() rearranges the data frame by specifying the columns that are categorical variables with a - notation. In this case, Species and Flower are categorical. Complete the command in the r code chunk below.

separate() splits up the new key column, which contains the former headers, according to .. The new column names “Part” and “Measure” are given in a character vector.

The last step is to use spread() to distribute the new Measure column and associated value column into two columns.

# Load the tidyr package
library(tidyr)

# Add column with unique ids (don't need to change)
iris$Flower <- 1:nrow(iris)

# Fill in the ___ to produce to the correct iris.wide dataset
iris.wide <- iris %>%
  gather(key, value, -Flower, -Species) %>%
  separate(key, c("Part", "Measure"), "\\.") %>%
  spread(Measure, value)

Visualization Coding Exercise #3 (Homework #3)

Tanya Mohte

2018-09-30