Data wrangling

In this chapter, you’ll learn to do three things with a table: filter for particular observations, arrange the observations in a desired order, and mutate to add or change a column. You’ll see how each of these steps lets you answer questions about your data.

Loading the gapminder and dplyr packages Before you can work with the gapminder dataset, you’ll need to load two R packages that contain the tools for working with it, then display the gapminder dataset so that you can see what it contains.

To your right, you’ll see two windows inside which you can enter code: The script.R window, and the R Console. All of your code to solve each exercise must go inside script.R.

If you hit Submit Answer, your R script is executed and the output is shown in the R Console. DataCamp checks whether your submission is correct and gives you feedback. You can hit Submit Answer as often as you want. If you’re stuck, you can ask for a hint or a solution.

You can use the R Console interactively by simply typing R code and hitting Enter. When you work in the console directly, your code will not be checked for correctness so it is a great way to experiment and explore.

Use the library() function to load the dplyr package, just like we’ve loaded the gapminder package for you. Type gapminder, on its own line, to look at the gapminder dataset.

HINT You would load the dplyr package with library(dplyr), just like gapminder.

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIExvYWQgdGhlIGdhcG1pbmRlciBwYWNrYWdlXG5saWJyYXJ5KGdhcG1pbmRlcilcblxuIyBMb2FkIHRoZSBkcGx5ciBwYWNrYWdlXG5cblxuIyBMb29rIGF0IHRoZSBnYXBtaW5kZXIgZGF0YXNldCIsInNvbHV0aW9uIjoiIyBMb2FkIHRoZSBnYXBtaW5kZXIgcGFja2FnZVxubGlicmFyeShnYXBtaW5kZXIpXG5cbiMgTG9hZCB0aGUgZHBseXIgcGFja2FnZVxubGlicmFyeShkcGx5cilcblxuIyBMb29rIGF0IHRoZSBnYXBtaW5kZXIgZGF0YXNldFxuZ2FwbWluZGVyIn0=

Filtering for one year

The filter verb extracts particular observations based on a condition. In this exercise you’ll filter for observations from a particular year.

Add a filter() line after the pipe (%>%) to extract only the observations from the year 1957. Remember that you use == to compare two values.

HINT The condition within the filter() will be year == 1957.

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KGdhcG1pbmRlcilcbmxpYnJhcnkoZHBseXIpXG5cbiMgRmlsdGVyIHRoZSBnYXBtaW5kZXIgZGF0YXNldCBmb3IgdGhlIHllYXIgMTk1N1xuZ2FwbWluZGVyICU+JSIsInNvbHV0aW9uIjoibGlicmFyeShnYXBtaW5kZXIpXG5saWJyYXJ5KGRwbHlyKVxuXG4jIEZpbHRlciB0aGUgZ2FwbWluZGVyIGRhdGFzZXQgZm9yIHRoZSB5ZWFyIDE5NTdcbmdhcG1pbmRlciAlPiVcbiAgZmlsdGVyKHllYXIgPT0gMTk1NykifQ==

Filtering for one country and one year

You can also use the filter() verb to set two conditions, which could retrieve a single observation.

Just like in the last exercise, you can do this in two lines of code, starting with gapminder %>% and having the filter() on the second line. Keeping one verb on each line helps keep the code readable. Note that each time, you’ll put the pipe %>% at the end of the first line (like gapminder %>%); putting the pipe at the beginning of the second line will throw an error.

Filter the gapminder data to retrieve only the observation from China in the year 2002.

HINT You’ll need to provide two conditions inside the filter(), separated by a comma. The year condition will be similar to the last exercise, and remember to put “China” in quotes.

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KGdhcG1pbmRlcilcbmxpYnJhcnkoZHBseXIpXG5cbiMgRmlsdGVyIGZvciBDaGluYSBpbiAyMDAyIiwic29sdXRpb24iOiJsaWJyYXJ5KGdhcG1pbmRlcilcbmxpYnJhcnkoZHBseXIpXG5cbiMgRmlsdGVyIGZvciBDaGluYSBpbiAyMDAyXG5nYXBtaW5kZXIgJT4lXG4gIGZpbHRlcihjb3VudHJ5ID09IFwiQ2hpbmFcIiwgeWVhciA9PSAyMDAyKSJ9

Arranging observations by life expectancy

You use arrange() to sort observations in ascending or descending order of a particular variable. In this case, you’ll sort the dataset based on the lifeExp variable.

Sort the gapminder dataset in ascending order of life expectancy (lifeExp). Sort the gapminder dataset in descending order of life expectancy.

HINT To specify that you want the output in descending order, use desc(lifeExp) inside the arrange().

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KGdhcG1pbmRlcilcbmxpYnJhcnkoZHBseXIpXG5cbiMgU29ydCBpbiBhc2NlbmRpbmcgb3JkZXIgb2YgbGlmZUV4cFxuXG5cbiAgXG4jIFNvcnQgaW4gZGVzY2VuZGluZyBvcmRlciBvZiBsaWZlRXhwIiwic29sdXRpb24iOiJsaWJyYXJ5KGdhcG1pbmRlcilcbmxpYnJhcnkoZHBseXIpXG5cbiMgU29ydCBpbiBhc2NlbmRpbmcgb3JkZXIgb2YgbGlmZUV4cFxuZ2FwbWluZGVyICU+JVxuICBhcnJhbmdlKGxpZmVFeHApXG4gIFxuIyBTb3J0IGluIGRlc2NlbmRpbmcgb3JkZXIgb2YgbGlmZUV4cFxuZ2FwbWluZGVyICU+JVxuICBhcnJhbmdlKGRlc2MobGlmZUV4cCkpIn0=

Filtering and arranging You’ll often need to use the pipe operator (%>%) to combine multiple dplyr verbs in a row. In this case, you’ll combine a filter() with an arrange() to find the highest population countries in a particular year.

Use filter() to extract observations from just the year 1957, then use arrange() to sort in descending order of population (pop).

HINT Right after the filter(year == 1957), you should immediately have another %>%, followed by your arrange() step.

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KGdhcG1pbmRlcilcbmxpYnJhcnkoZHBseXIpXG5cbiMgRmlsdGVyIGZvciB0aGUgeWVhciAxOTU3LCB0aGVuIGFycmFuZ2UgaW4gZGVzY2VuZGluZyBvcmRlciBvZiBwb3B1bGF0aW9uIiwic29sdXRpb24iOiJsaWJyYXJ5KGdhcG1pbmRlcilcbmxpYnJhcnkoZHBseXIpXG5cbiMgRmlsdGVyIGZvciB0aGUgeWVhciAxOTU3LCB0aGVuIGFycmFuZ2UgaW4gZGVzY2VuZGluZyBvcmRlciBvZiBwb3B1bGF0aW9uXG5nYXBtaW5kZXIgJT4lXG4gIGZpbHRlcih5ZWFyID09IDE5NTcpICU+JVxuICBhcnJhbmdlKGRlc2MocG9wKSkifQ==

Using mutate to change or create a column

Suppose we want life expectancy to be measured in months instead of years: you’d have to multiply the existing value by 12. You can use the mutate() verb to change this column, or to create a new column that’s calculated this way.

Use mutate() to change the existing lifeExp column, by multiplying it by 12: 12 * lifeExp. Use mutate() to add a new column, called lifeExpMonths, calculated as 12 * lifeExp.

HINT The code in the first mutate should be lifeExp = 12 * lifeExp. The second is similar, but with a different column name on the left side of the equals.

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KGdhcG1pbmRlcilcbmxpYnJhcnkoZHBseXIpXG5cbiMgVXNlIG11dGF0ZSB0byBjaGFuZ2UgbGlmZUV4cCB0byBiZSBpbiBtb250aHNcblxuXG4jIFVzZSBtdXRhdGUgdG8gY3JlYXRlIGEgbmV3IGNvbHVtbiBjYWxsZWQgbGlmZUV4cE1vbnRocyIsInNvbHV0aW9uIjoibGlicmFyeShnYXBtaW5kZXIpXG5saWJyYXJ5KGRwbHlyKVxuXG4jIFVzZSBtdXRhdGUgdG8gY2hhbmdlIGxpZmVFeHAgdG8gYmUgaW4gbW9udGhzXG5nYXBtaW5kZXIgJT4lXG4gIG11dGF0ZShsaWZlRXhwID0gbGlmZUV4cCAqIDEyKVxuXG4jIFVzZSBtdXRhdGUgdG8gY3JlYXRlIGEgbmV3IGNvbHVtbiBjYWxsZWQgbGlmZUV4cE1vbnRoc1xuZ2FwbWluZGVyICU+JVxuICBtdXRhdGUobGlmZUV4cE1vbnRocyA9IGxpZmVFeHAgKiAxMikifQ==

Combining filter, mutate, and arrange

In this exercise, you’ll combine all three of the verbs you’ve learned in this chapter, to find the countries with the highest life expectancy, in months, in the year 2007.

In one sequence of pipes on the gapminder dataset: filter() for observations from the year 2007, mutate() to create a column lifeExpMonths, calculated as 12 * lifeExp, and arrange() in descending order of that new column

HINT You’ve done each of these individually before: just make sure that you put a %>% after each step. For example, the first step looks like filter(year == 2007) %>%.

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KGdhcG1pbmRlcilcbmxpYnJhcnkoZHBseXIpXG5cbiMgRmlsdGVyLCBtdXRhdGUsIGFuZCBhcnJhbmdlIHRoZSBnYXBtaW5kZXIgZGF0YXNldCIsInNvbHV0aW9uIjoibGlicmFyeShnYXBtaW5kZXIpXG5saWJyYXJ5KGRwbHlyKVxuXG4jIEZpbHRlciwgbXV0YXRlLCBhbmQgYXJyYW5nZSB0aGUgZ2FwbWluZGVyIGRhdGFzZXRcbmdhcG1pbmRlciAlPiVcbiAgZmlsdGVyKHllYXIgPT0gMjAwNykgJT4lXG4gIG11dGF0ZShsaWZlRXhwTW9udGhzID0gMTIgKiBsaWZlRXhwKSAlPiVcbiAgYXJyYW5nZShkZXNjKGxpZmVFeHBNb250aHMpKSJ9

tidyverse_1