An introduction to data wrangling

One really useful R package is dplyr. This package was developed to make data wrangling simple and easy. Data wrangling is a crucial part of doing statistics or any data analysis task. I’d estimate that up to 80% of the time I spend with data is just manipulation, cleaning, and QA/QC tasks. A lot of this can be monotonous and boring, but thankfully dplyr provides some very useful functionality for accomplishing data manipulation tasks easily and efficiently.

dplyr is part of the tidyverse, which is a collection of R packages that all share the same underlying philosophy about data structures and coding grammar. I am not personally familiar with all of the tidyverse packages yet, but I have found that I really like the ones I’ve used. Later when we get to plotting, we will work a lot with another tidyverse package, which is ggplot2. If you are new to R, I highly recommend becoming familiar with the tidyverse philosophy of data science. Unfortunately, I started learning R just before tidyverse really took off in popularity so I still sometimes fall back into old bad habits. But I am trying to retrain my brain to work within the tidyverse philosophy whenever possible. You can learn more about the tidyverse philosophy here. For convenience, though, I think it will be helpful to tell you the core tenets of the philosophy:

Every column represents a variable.
Every row represents an observation.
Every cell contains a single value.

And it’s that simple. Storing your data in a tidy format facilitates analysis. It’s a good habit to get into early in your career, so do yourself a favor!

If you haven’t already, install the dplyr package and call it up with library().

library(readr) #You'll need to install this package to access data from my GitHub account
library(dplyr)
library(ggplot2)

Now lets load the dataset we are going to use for this exercise. I found a cool dataset collected by the U.N. on world happiness. I gathered data from the world happiness report from 2015 to 2019, which includes happieness scores from 150+ countries and a variety of other covariates, like GPD per capita, perceptions of corruption in government, and the quality of their social network (friends and family). To read more about their methodology and for the full reports, you can look at it here.

x <- "https://raw.githubusercontent.com/kvistj/R-tips/master/world_happiness_UN.csv"
happiness <- read_csv(url(x))

## Parsed with column specification:
## cols(
##   Year = col_double(),
##   Country = col_character(),
##   Region = col_character(),
##   Score = col_double(),
##   GDPperCapita = col_double(),
##   SocialSupport = col_double(),
##   HealthyLifeExp = col_double(),
##   LifeChoiceFreedom = col_double(),
##   Generosity = col_double(),
##   PerceptionCorrupt = col_character()
## )

happiness <- as_tibble(happiness)
glimpse(happiness)

## Observations: 782
## Variables: 10
## $ Year              <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 20...
## $ Country           <chr> "Finland", "Denmark", "Norway", "Iceland", "...
## $ Region            <chr> "Europe", "Europe", "Europe", "Europe", "Eur...
## $ Score             <dbl> 7.769, 7.600, 7.554, 7.494, 7.488, 7.480, 7....
## $ GDPperCapita      <dbl> 1.340, 1.383, 1.488, 1.380, 1.396, 1.452, 1....
## $ SocialSupport     <dbl> 1.587, 1.573, 1.582, 1.624, 1.522, 1.526, 1....
## $ HealthyLifeExp    <dbl> 0.986, 0.996, 1.028, 1.026, 0.999, 1.052, 1....
## $ LifeChoiceFreedom <dbl> 0.596, 0.592, 0.603, 0.591, 0.557, 0.572, 0....
## $ Generosity        <dbl> 0.153, 0.252, 0.271, 0.354, 0.322, 0.263, 0....
## $ PerceptionCorrupt <chr> "0.393", "0.41", "0.341", "0.118", "0.298", ...

Notice the first thing I did was I converted the happiness dataset into a tibble. What is a tibble? Tibbles are special kind of data frame that dplyr is meant to be used with. Basically, a tibble is just a simpler data frame. They are designed to keep important features of the original data that was imported, meaning they do not convert data types automatically (i.e. converting characters to strings). You can also use non-standard names for your columns that include special symbols or spaces that normally R will throw fits about. There is not much of a difference in practice whether working with tibbles or data frames, but the idea behind the tibble is to force you to confront problems with your data efficiently and early to prevent headaches later on while doing analyses.

The glimpse() function I just called is similar to the str() function you met in my last post. glimpse() is part of dplyr and it just shows the structure of your data, as well as the first several values for each variable.

There appears to be a problem with the PerceptionCorrupt column. It looks like R read it in as a character variable. That’s clearly wrong. We want it to be a numeric column instead. Correcting this issue is simple.

happiness$PerceptionCorrupt <- as.numeric(happiness$PerceptionCorrupt)

## Warning: NAs introduced by coercion

glimpse(happiness)

## Observations: 782
## Variables: 10
## $ Year              <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 20...
## $ Country           <chr> "Finland", "Denmark", "Norway", "Iceland", "...
## $ Region            <chr> "Europe", "Europe", "Europe", "Europe", "Eur...
## $ Score             <dbl> 7.769, 7.600, 7.554, 7.494, 7.488, 7.480, 7....
## $ GDPperCapita      <dbl> 1.340, 1.383, 1.488, 1.380, 1.396, 1.452, 1....
## $ SocialSupport     <dbl> 1.587, 1.573, 1.582, 1.624, 1.522, 1.526, 1....
## $ HealthyLifeExp    <dbl> 0.986, 0.996, 1.028, 1.026, 0.999, 1.052, 1....
## $ LifeChoiceFreedom <dbl> 0.596, 0.592, 0.603, 0.591, 0.557, 0.572, 0....
## $ Generosity        <dbl> 0.153, 0.252, 0.271, 0.354, 0.322, 0.263, 0....
## $ PerceptionCorrupt <dbl> 0.393, 0.410, 0.341, 0.118, 0.298, 0.343, 0....

It looks like R has coerced some of the values in happiness$PerceptionCorrupt to NAs. We will talk about how to deal with those later.

Without further ado, lets get into some of the main features of the ‘dplyr’ package.

Filtering

One of the most common data manipulation tasks is filtering. Most of the time, our datasets are large and we may only be interested in one group within our data at a time. The dplyr package has a great function for helping us filter our data easily. All we have to do is call the filter() function, which will select cases in our data based on the condition we specify. For example, what if I was only interested in European countries from the happiness dataset?

happiness %>% 
  filter(Region == "Europe") #R uses double equals signs when specifying conditions

## # A tibble: 220 x 10
##     Year Country Region Score GDPperCapita SocialSupport HealthyLifeExp
##    <dbl> <chr>   <chr>  <dbl>        <dbl>         <dbl>          <dbl>
##  1  2019 Finland Europe  7.77         1.34          1.59          0.986
##  2  2019 Denmark Europe  7.6          1.38          1.57          0.996
##  3  2019 Norway  Europe  7.55         1.49          1.58          1.03 
##  4  2019 Iceland Europe  7.49         1.38          1.62          1.03 
##  5  2019 Nether~ Europe  7.49         1.40          1.52          0.999
##  6  2019 Switze~ Europe  7.48         1.45          1.53          1.05 
##  7  2019 Sweden  Europe  7.34         1.39          1.49          1.01 
##  8  2019 Austria Europe  7.25         1.38          1.48          1.02 
##  9  2019 Luxemb~ Europe  7.09         1.61          1.48          1.01 
## 10  2019 United~ Europe  7.05         1.33          1.54          0.996
## # ... with 210 more rows, and 3 more variables: LifeChoiceFreedom <dbl>,
## #   Generosity <dbl>, PerceptionCorrupt <dbl>

An important concept common across all of the tidyverse packages is the pipe operator %>%. The pipe %>% is basically a symbol that tells R to chain chunks of code together. Semantically, the code I wrote above could be thought of as “with the happiness dataset, return cases where region equals Europe”. The pipe %>% is equivalent to the word ‘with’ in my sentence. It is the link that connects the object happiness to the action filter().

Now it is important to note that R will perform these operations, but unless I save it in a new object, it will not be a permanent operation. For example, if I call up the happiness dataset again, it will not be filtered out.

happiness

## # A tibble: 782 x 10
##     Year Country Region Score GDPperCapita SocialSupport HealthyLifeExp
##    <dbl> <chr>   <chr>  <dbl>        <dbl>         <dbl>          <dbl>
##  1  2019 Finland Europe  7.77         1.34          1.59          0.986
##  2  2019 Denmark Europe  7.6          1.38          1.57          0.996
##  3  2019 Norway  Europe  7.55         1.49          1.58          1.03 
##  4  2019 Iceland Europe  7.49         1.38          1.62          1.03 
##  5  2019 Nether~ Europe  7.49         1.40          1.52          0.999
##  6  2019 Switze~ Europe  7.48         1.45          1.53          1.05 
##  7  2019 Sweden  Europe  7.34         1.39          1.49          1.01 
##  8  2019 New Ze~ Ocean~  7.31         1.30          1.56          1.03 
##  9  2019 Canada  North~  7.28         1.36          1.50          1.04 
## 10  2019 Austria Europe  7.25         1.38          1.48          1.02 
## # ... with 772 more rows, and 3 more variables: LifeChoiceFreedom <dbl>,
## #   Generosity <dbl>, PerceptionCorrupt <dbl>

#If I wanted to save a subsetted version of the data, I simply store the operation into a new object called 'EuropeanCountries'
EuropeanCountries <- happiness %>% 
                       filter(Region == "Europe")
EuropeanCountries

## # A tibble: 220 x 10
##     Year Country Region Score GDPperCapita SocialSupport HealthyLifeExp
##    <dbl> <chr>   <chr>  <dbl>        <dbl>         <dbl>          <dbl>
##  1  2019 Finland Europe  7.77         1.34          1.59          0.986
##  2  2019 Denmark Europe  7.6          1.38          1.57          0.996
##  3  2019 Norway  Europe  7.55         1.49          1.58          1.03 
##  4  2019 Iceland Europe  7.49         1.38          1.62          1.03 
##  5  2019 Nether~ Europe  7.49         1.40          1.52          0.999
##  6  2019 Switze~ Europe  7.48         1.45          1.53          1.05 
##  7  2019 Sweden  Europe  7.34         1.39          1.49          1.01 
##  8  2019 Austria Europe  7.25         1.38          1.48          1.02 
##  9  2019 Luxemb~ Europe  7.09         1.61          1.48          1.01 
## 10  2019 United~ Europe  7.05         1.33          1.54          0.996
## # ... with 210 more rows, and 3 more variables: LifeChoiceFreedom <dbl>,
## #   Generosity <dbl>, PerceptionCorrupt <dbl>

How about if we wanted to compare European countries to North American countries and needed a subset of our data that only included those two regions?

happiness %>% 
  filter(Region %in% c("Europe", "North America"))

## # A tibble: 236 x 10
##     Year Country Region Score GDPperCapita SocialSupport HealthyLifeExp
##    <dbl> <chr>   <chr>  <dbl>        <dbl>         <dbl>          <dbl>
##  1  2019 Finland Europe  7.77         1.34          1.59          0.986
##  2  2019 Denmark Europe  7.6          1.38          1.57          0.996
##  3  2019 Norway  Europe  7.55         1.49          1.58          1.03 
##  4  2019 Iceland Europe  7.49         1.38          1.62          1.03 
##  5  2019 Nether~ Europe  7.49         1.40          1.52          0.999
##  6  2019 Switze~ Europe  7.48         1.45          1.53          1.05 
##  7  2019 Sweden  Europe  7.34         1.39          1.49          1.01 
##  8  2019 Canada  North~  7.28         1.36          1.50          1.04 
##  9  2019 Austria Europe  7.25         1.38          1.48          1.02 
## 10  2019 Luxemb~ Europe  7.09         1.61          1.48          1.01 
## # ... with 226 more rows, and 3 more variables: LifeChoiceFreedom <dbl>,
## #   Generosity <dbl>, PerceptionCorrupt <dbl>

The %in% operator tells R to filter the happiness dataset using the region names we specify IN the Region column. The way I like to think about it is “Within the Region column, find and return cases for both Europe and North America”.

Okay, but what if you wanted to filter your data based on conditions applied to more than one variable? Let’s look at an example.

happiness %>% 
  filter(Region == "Europe" & SocialSupport <= 1)

## # A tibble: 29 x 10
##     Year Country Region Score GDPperCapita SocialSupport HealthyLifeExp
##    <dbl> <chr>   <chr>  <dbl>        <dbl>         <dbl>          <dbl>
##  1  2019 Albania Europe  4.72        0.947         0.848          0.874
##  2  2018 Albania Europe  4.59        0.916         0.817          0.79 
##  3  2017 Croatia Europe  5.29        1.22          0.968          0.701
##  4  2017 Albania Europe  4.64        0.996         0.804          0.731
##  5  2016 Saudi ~ Europe  6.38        1.49          0.848          0.593
##  6  2016 Moldova Europe  5.90        0.692         0.831          0.523
##  7  2016 Northe~ Europe  5.77        1.31          0.818          0.841
##  8  2016 Latvia  Europe  5.56        1.22          0.950          0.640
##  9  2016 Cyprus  Europe  5.55        1.32          0.707          0.849
## 10  2016 Romania Europe  5.53        1.17          0.728          0.676
## # ... with 19 more rows, and 3 more variables: LifeChoiceFreedom <dbl>,
## #   Generosity <dbl>, PerceptionCorrupt <dbl>

This statement include the & operator, which tells R that I want it to return cases that satisfy two conditions. In this case, I asked for countries that are in Europe AND have SocialSupport scores less than or equal to 1.

What if we wanted to see everything EXCEPT a certain case of our data? Rather than laying our a long list of filtering operations, we could just tell R to drop that one case. For example, what if we weren’t that interested in countries with a high GDP? We could tell R to drop cases with a GDP above a certain value, say over 1.2. We would do the same as we did before with the SocialSupport condition, but now we need a new operator. The operator for dropping cases in R is !.

happiness %>% 
  filter(!GDPperCapita > 1.2)

## # A tibble: 561 x 10
##     Year Country Region Score GDPperCapita SocialSupport HealthyLifeExp
##    <dbl> <chr>   <chr>  <dbl>        <dbl>         <dbl>          <dbl>
##  1  2019 Costa ~ Centr~  7.17        1.03           1.44          0.963
##  2  2019 Mexico  North~  6.60        1.07           1.32          0.861
##  3  2019 Chile   South~  6.44        1.16           1.37          0.92 
##  4  2019 Guatem~ Centr~  6.44        0.8            1.27          0.746
##  5  2019 Panama  Centr~  6.32        1.15           1.44          0.91 
##  6  2019 Brazil  South~  6.3         1.00           1.44          0.802
##  7  2019 Uruguay South~  6.29        1.12           1.46          0.891
##  8  2019 El Sal~ Centr~  6.25        0.794          1.24          0.789
##  9  2019 Uzbeki~ Europe  6.17        0.745          1.53          0.756
## 10  2019 Colomb~ South~  6.12        0.985          1.41          0.841
## # ... with 551 more rows, and 3 more variables: LifeChoiceFreedom <dbl>,
## #   Generosity <dbl>, PerceptionCorrupt <dbl>

Arranging

There is much more you can do with filter() and your operations can get quite complex, but I am going to move on now and focus on another key function, arrange(). arrange() works by reordering your data by whichever columns you specify.

happiness %>%
  arrange(Score)

## # A tibble: 782 x 10
##     Year Country Region Score GDPperCapita SocialSupport HealthyLifeExp
##    <dbl> <chr>   <chr>  <dbl>        <dbl>         <dbl>          <dbl>
##  1  2017 Centra~ Africa  2.69       0              0             0.0188
##  2  2015 Togo    Africa  2.84       0.209          0.140         0.284 
##  3  2019 South ~ Africa  2.85       0.306          0.575         0.295 
##  4  2017 Burundi Africa  2.90       0.0916         0.630         0.152 
##  5  2018 Burundi Africa  2.90       0.091          0.627         0.145 
##  6  2016 Burundi Africa  2.90       0.0683         0.234         0.157 
##  7  2015 Burundi Africa  2.90       0.0153         0.416         0.224 
##  8  2015 Syria   Middl~  3.01       0.663          0.475         0.722 
##  9  2016 Syria   Middl~  3.07       0.747          0.149         0.630 
## 10  2019 Centra~ Africa  3.08       0.026          0             0.105 
## # ... with 772 more rows, and 3 more variables: LifeChoiceFreedom <dbl>,
## #   Generosity <dbl>, PerceptionCorrupt <dbl>

Now the data is arranged in ascending order based on their happiness score. To get the descending order, I just need to include desc() within arrange().

happiness %>%
  arrange(desc(Score))

## # A tibble: 782 x 10
##     Year Country Region Score GDPperCapita SocialSupport HealthyLifeExp
##    <dbl> <chr>   <chr>  <dbl>        <dbl>         <dbl>          <dbl>
##  1  2019 Finland Europe  7.77         1.34          1.59          0.986
##  2  2018 Finland Europe  7.63         1.30          1.59          0.874
##  3  2019 Denmark Europe  7.6          1.38          1.57          0.996
##  4  2018 Norway  Europe  7.59         1.46          1.58          0.861
##  5  2015 Switze~ Europe  7.59         1.40          1.35          0.941
##  6  2015 Iceland Europe  7.56         1.30          1.40          0.948
##  7  2018 Denmark Europe  7.56         1.35          1.59          0.868
##  8  2019 Norway  Europe  7.55         1.49          1.58          1.03 
##  9  2017 Norway  Europe  7.54         1.62          1.53          0.797
## 10  2015 Denmark Europe  7.53         1.33          1.36          0.875
## # ... with 772 more rows, and 3 more variables: LifeChoiceFreedom <dbl>,
## #   Generosity <dbl>, PerceptionCorrupt <dbl>

You can also arrange your data based on more than one column. The order you specify your columns will tell R which columns to organize first by, then second, and so on.

happiness %>%
  arrange(GDPperCapita, desc(Generosity))

## # A tibble: 782 x 10
##     Year Country Region Score GDPperCapita SocialSupport HealthyLifeExp
##    <dbl> <chr>   <chr>  <dbl>        <dbl>         <dbl>          <dbl>
##  1  2016 Somalia Africa  5.44       0              0.336         0.115 
##  2  2017 Centra~ Africa  2.69       0              0             0.0188
##  3  2015 Congo ~ Africa  4.52       0              1.00          0.0981
##  4  2019 Somalia Africa  4.67       0              0.698         0.268 
##  5  2018 Somalia Africa  4.98       0              0.712         0.115 
##  6  2015 Burundi Africa  2.90       0.0153         0.416         0.224 
##  7  2015 Malawi  Africa  4.29       0.0160         0.411         0.226 
##  8  2017 Somalia Africa  5.15       0.0226         0.721         0.114 
##  9  2018 Centra~ Africa  3.08       0.024          0             0.01  
## 10  2019 Centra~ Africa  3.08       0.026          0             0.105 
## # ... with 772 more rows, and 3 more variables: LifeChoiceFreedom <dbl>,
## #   Generosity <dbl>, PerceptionCorrupt <dbl>

Or you can structure your arrangement by group with group_by.

happiness %>%
  group_by(Region) %>%
  arrange(desc(HealthyLifeExp))

## # A tibble: 782 x 10
## # Groups:   Region [9]
##     Year Country Region Score GDPperCapita SocialSupport HealthyLifeExp
##    <dbl> <chr>   <chr>  <dbl>        <dbl>         <dbl>          <dbl>
##  1  2019 Singap~ Asia    6.26         1.57          1.46           1.14
##  2  2019 Hong K~ Asia    5.43         1.44          1.28           1.12
##  3  2019 Japan   Asia    5.89         1.33          1.42           1.09
##  4  2019 Spain   Europe  6.35         1.29          1.48           1.06
##  5  2019 Switze~ Europe  7.48         1.45          1.53           1.05
##  6  2019 France  Europe  6.59         1.32          1.47           1.04
##  7  2019 Cyprus  Europe  6.05         1.26          1.22           1.04
##  8  2019 Northe~ Europe  5.72         1.26          1.25           1.04
##  9  2019 Canada  North~  7.28         1.36          1.50           1.04
## 10  2019 Italy   Europe  6.22         1.29          1.49           1.04
## # ... with 772 more rows, and 3 more variables: LifeChoiceFreedom <dbl>,
## #   Generosity <dbl>, PerceptionCorrupt <dbl>

Slicing

Slicing lets you select rows based on their integer locations. So if you knew the row positions for a group of data you wanted to get, you could use slice().

happiness %>%
  slice(15:30)

## # A tibble: 16 x 10
##     Year Country Region Score GDPperCapita SocialSupport HealthyLifeExp
##    <dbl> <chr>   <chr>  <dbl>        <dbl>         <dbl>          <dbl>
##  1  2019 United~ Europe  7.05         1.33          1.54          0.996
##  2  2019 Ireland Europe  7.02         1.50          1.55          0.999
##  3  2019 Germany Europe  6.98         1.37          1.45          0.987
##  4  2019 Belgium Europe  6.92         1.36          1.50          0.986
##  5  2019 United~ North~  6.89         1.43          1.46          0.874
##  6  2019 Czech ~ Europe  6.85         1.27          1.49          0.92 
##  7  2019 United~ Middl~  6.82         1.50          1.31          0.825
##  8  2019 Malta   Europe  6.73         1.3           1.52          0.999
##  9  2019 Mexico  North~  6.60         1.07          1.32          0.861
## 10  2019 France  Europe  6.59         1.32          1.47          1.04 
## 11  2019 Taiwan  Asia    6.45         1.37          1.43          0.914
## 12  2019 Chile   South~  6.44         1.16          1.37          0.92 
## 13  2019 Guatem~ Centr~  6.44         0.8           1.27          0.746
## 14  2019 Saudi ~ Europe  6.38         1.40          1.36          0.795
## 15  2019 Qatar   Middl~  6.37         1.68          1.31          0.871
## 16  2019 Spain   Europe  6.35         1.29          1.48          1.06 
## # ... with 3 more variables: LifeChoiceFreedom <dbl>, Generosity <dbl>,
## #   PerceptionCorrupt <dbl>

Sampling

A particularly useful feature of dplyr is the ability to randomly sample from your data. This might be especially helpful for getting sample data to train an algorithm, for instance. Or for evaluating the performance of a model with some kind of cross-validation technique. Or for generating bootstrapped confidence intervals, perhaps.

happiness %>%
  sample_n(size = 10) #Randomly samples 10 rows

## # A tibble: 10 x 10
##     Year Country Region Score GDPperCapita SocialSupport HealthyLifeExp
##    <dbl> <chr>   <chr>  <dbl>        <dbl>         <dbl>          <dbl>
##  1  2016 Congo ~ Africa  4.27       0.0566         0.807         0.188 
##  2  2017 Angola  Africa  3.80       0.858          1.10          0.0499
##  3  2017 Switze~ Europe  7.49       1.56           1.52          0.858 
##  4  2019 Zambia  Africa  4.11       0.578          1.06          0.426 
##  5  2019 Italy   Europe  6.22       1.29           1.49          1.04  
##  6  2016 Romania Europe  5.53       1.17           0.728         0.676 
##  7  2018 Malawi  Africa  3.59       0.186          0.541         0.306 
##  8  2018 Iraq    Middl~  4.46       1.01           0.971         0.536 
##  9  2016 Burundi Africa  2.90       0.0683         0.234         0.157 
## 10  2015 Moldova Europe  5.89       0.594          1.02          0.618 
## # ... with 3 more variables: LifeChoiceFreedom <dbl>, Generosity <dbl>,
## #   PerceptionCorrupt <dbl>

happiness %>%
  sample_frac(size = 0.3) #Randomly samples 30% of the data

## # A tibble: 235 x 10
##     Year Country Region Score GDPperCapita SocialSupport HealthyLifeExp
##    <dbl> <chr>   <chr>  <dbl>        <dbl>         <dbl>          <dbl>
##  1  2015 Senegal Africa  3.90        0.365         0.976          0.435
##  2  2017 Haiti   Carri~  3.60        0.369         0.640          0.277
##  3  2019 Cambod~ Asia    4.7         0.574         1.12           0.637
##  4  2015 Guinea  Africa  3.66        0.174         0.465          0.240
##  5  2019 Poland  Europe  6.18        1.21          1.44           0.884
##  6  2016 Russia  Europe  5.86        1.23          1.05           0.590
##  7  2018 Botswa~ Africa  3.59        1.02          1.17           0.417
##  8  2017 New Ze~ Ocean~  7.31        1.41          1.55           0.817
##  9  2019 Algeria Africa  5.21        1.00          1.16           0.785
## 10  2017 Rwanda  Africa  3.47        0.369         0.946          0.326
## # ... with 225 more rows, and 3 more variables: LifeChoiceFreedom <dbl>,
## #   Generosity <dbl>, PerceptionCorrupt <dbl>

happiness %>%
  sample_frac(size = 0.3, replace = TRUE) #You can also add an argument to sample with replacement

## # A tibble: 235 x 10
##     Year Country Region Score GDPperCapita SocialSupport HealthyLifeExp
##    <dbl> <chr>   <chr>  <dbl>        <dbl>         <dbl>          <dbl>
##  1  2019 Congo ~ Africa  4.81        0.673         0.799          0.508
##  2  2019 Madaga~ Africa  3.93        0.274         0.916          0.555
##  3  2018 Portug~ Europe  5.41        1.19          1.43           0.884
##  4  2017 Argent~ South~  6.60        1.19          1.44           0.695
##  5  2019 Thaila~ Asia    6.01        1.05          1.41           0.828
##  6  2015 Bhutan  Asia    5.25        0.770         1.10           0.574
##  7  2016 Ecuador Centr~  5.98        0.973         0.860          0.686
##  8  2019 Myanmar Asia    4.36        0.71          1.18           0.555
##  9  2019 Jamaica Carri~  5.89        0.831         1.48           0.831
## 10  2017 Poland  Europe  5.97        1.29          1.45           0.699
## # ... with 225 more rows, and 3 more variables: LifeChoiceFreedom <dbl>,
## #   Generosity <dbl>, PerceptionCorrupt <dbl>

Selecting

Selecting is a function similar to filter, except it’s for columns instead of rows. When you use select(), you are asking R to find certain columns and return those for you. This is useful when you have a large dataset for which many variables are measure per observation, but you are only interested in a few at a time.

happiness %>%
  dplyr::select(Year, Country, Score, PerceptionCorrupt)

## # A tibble: 782 x 4
##     Year Country     Score PerceptionCorrupt
##    <dbl> <chr>       <dbl>             <dbl>
##  1  2019 Finland      7.77             0.393
##  2  2019 Denmark      7.6              0.41 
##  3  2019 Norway       7.55             0.341
##  4  2019 Iceland      7.49             0.118
##  5  2019 Netherlands  7.49             0.298
##  6  2019 Switzerland  7.48             0.343
##  7  2019 Sweden       7.34             0.373
##  8  2019 New Zealand  7.31             0.38 
##  9  2019 Canada       7.28             0.308
## 10  2019 Austria      7.25             0.226
## # ... with 772 more rows

Notice I had to append dplyr:: to the front of my select command. This is because I have multiple packages loaded that have a function called select() and when I call select(), R is confused as to which package it should call the function from. So the bit of code dplyr:: tells R that I want to use the select() function from dplyr::. If you ever get weird errors when trying to run a function and you’re sure that you didn’t make any syntax errors, look to see whether it’s an issue with package conflict.

I can also select from variables that contain certain keywords.

happiness %>%
  dplyr::select(Year, Country, Score, contains("Life"))

## # A tibble: 782 x 5
##     Year Country     Score HealthyLifeExp LifeChoiceFreedom
##    <dbl> <chr>       <dbl>          <dbl>             <dbl>
##  1  2019 Finland      7.77          0.986             0.596
##  2  2019 Denmark      7.6           0.996             0.592
##  3  2019 Norway       7.55          1.03              0.603
##  4  2019 Iceland      7.49          1.03              0.591
##  5  2019 Netherlands  7.49          0.999             0.557
##  6  2019 Switzerland  7.48          1.05              0.572
##  7  2019 Sweden       7.34          1.01              0.574
##  8  2019 New Zealand  7.31          1.03              0.585
##  9  2019 Canada       7.28          1.04              0.584
## 10  2019 Austria      7.25          1.02              0.532
## # ... with 772 more rows

Mutating

Very often, we want to perform vectorized operations no our data. The mutate() function is a tool to do just that. For example, imagine that we have a dataset of survey results for some species. We might have columns that includes effort and then abundance for that species. We could easily use the mutate() function to calculate an index of relative abundance using these two columns.

But let’s look at an example with the happiness data. Suppose we want a new column, let’s call it Gen_to_GDP, and it will be a generosity to individual GDP ratio. Perhaps this is a more informative measurement than Generosity because it should indicate how generous citizens of each country are in relation to how much money they make.

happiness %>% 
  group_by(Country) %>%
  mutate(Gen_to_GDP = Generosity / GDPperCapita) %>%
  arrange(desc(Gen_to_GDP))

## # A tibble: 782 x 11
## # Groups:   Country [168]
##     Year Country Region Score GDPperCapita SocialSupport HealthyLifeExp
##    <dbl> <chr>   <chr>  <dbl>        <dbl>         <dbl>          <dbl>
##  1  2019 Somalia Africa  4.67       0              0.698         0.268 
##  2  2018 Somalia Africa  4.98       0              0.712         0.115 
##  3  2017 Centra~ Africa  2.69       0              0             0.0188
##  4  2016 Somalia Africa  5.44       0              0.336         0.115 
##  5  2015 Congo ~ Africa  4.52       0              1.00          0.0981
##  6  2015 Malawi  Africa  4.29       0.0160         0.411         0.226 
##  7  2015 Burundi Africa  2.90       0.0153         0.416         0.224 
##  8  2017 Somalia Africa  5.15       0.0226         0.721         0.114 
##  9  2018 Centra~ Africa  3.08       0.024          0             0.01  
## 10  2019 Centra~ Africa  3.08       0.026          0             0.105 
## # ... with 772 more rows, and 4 more variables: LifeChoiceFreedom <dbl>,
## #   Generosity <dbl>, PerceptionCorrupt <dbl>, Gen_to_GDP <dbl>

Notice how this time I used pipes %>% to chain two statements together? This is a key feature of pipes - they allow you to chain together multiple operations to execute in a single command, rather than having to save each operation in a new object each time. Elegant use of pipes can help improve the readability of your code and reduce the amount of objects you need to keep track of during your R session.

If you’re interested in creating a new tibble entirely out of the mutated data, you can just use transmute().

most_generous <- happiness %>%
                   group_by(Country) %>%
                   transmute(Gen_to_GDP = Generosity / GDPperCapita) %>%
                   arrange(desc(Gen_to_GDP))
most_generous

## # A tibble: 782 x 2
## # Groups:   Country [168]
##    Country                  Gen_to_GDP
##    <chr>                         <dbl>
##  1 Somalia                      Inf   
##  2 Somalia                      Inf   
##  3 Central African Republic     Inf   
##  4 Somalia                      Inf   
##  5 Congo (Kinshasa)             Inf   
##  6 Malawi                        20.7 
##  7 Burundi                       12.9 
##  8 Somalia                       12.9 
##  9 Central African Republic       9.08
## 10 Central African Republic       9.04
## # ... with 772 more rows

Now I’ve taken the previous operation and stored it in a new object called most_generous, simply by replacing the mutate() function with transmute().

Dealing with NAs

Handling missing data can be one of the trickiest things about data analysis. If you have NAs in your data, many of the functions you might want to use will not play nicely with the NAs. In the case that you have missing data, you have several options for what to do.

You can try a complete cases analysis by using the na.omit() function. A complete cases analysis will eliminate all rows of your data in which an NA appears in any column. This may be an appropriate way to deal with your NAs, but take caution! Using the na.omit() function on your dataframe may eliminate data that you might actually want to keep. Remember, it will look row by row for NAs and remove those rows where an NA appears in any column.

The first thing we should do though, is to ask R to check for our complete.cases().

complete.cases(happiness)

##   [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [12]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [23]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [34]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [45]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [56]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [67]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [78]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [89]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [100]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [111]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [122]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [133]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [144]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [155]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [166]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## [177]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [188]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [199]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [210]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [221]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [232]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [243]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [254]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [265]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [276]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [287]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [298]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [309]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [320]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [331]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [342]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [353]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [364]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [375]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [386]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [397]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [408]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [419]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [430]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [441]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [452]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [463]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [474]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [485]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [496]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [507]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [518]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [529]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [540]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [551]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [562]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [573]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [584]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [595]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [606]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [617]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [628]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [639]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [650]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [661]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [672]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [683]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [694]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [705]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [716]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [727]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [738]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [749]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [760]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [771]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [782]  TRUE

The complete.cases() function returns a logical vector with either a TRUE or FALSE value. It looks row by row through your data and tells you if you have an NA in any row. It looks like we have an incomplete case at row 176. If we use View(happiness) and scroll down to row 176 we should find our pesky NA in the PerceptionCorrupt column. Remember our warning message from earlier?

View(happiness)

Fortunately, it looks like we only have one NA. We have a couple of options for dealing with it. Our first option is just to get rid of it and pretend like it never happened.

happiness.sans.NA <- na.omit(happiness)
dim(happiness.sans.NA)

## [1] 781  10

dim(happiness) #using dim() we confirm that the dimensions of our new dataframe has only changed by 1.

## [1] 782  10

Generally, blindly getting rid of your NAs is not the best practice. But it probably won’t be the end of the world in most scenarios, especially if there are only a few NAs to deal with.

If you don’t want to get rid of your NAs, a common way to deal with them is to just replace them with the mean or median of the column vector.

happiness_replace <- happiness %>%
                        mutate(PerceptionCorrupt = replace(PerceptionCorrupt,
                                                           is.na(PerceptionCorrupt),
                                                           mean(PerceptionCorrupt, na.rm = T))) #Can you guess what to do if you wanted the median instead?

Now if we View(happiness_replace) and scroll down to row 176, we should see the NA has been replaced with the column mean of PerceptionCorrupt.

View(happiness_replace)

Alternatively, you could replace the NAs with a 0, or any other number of your choice. Sometimes, NAs are just replaced with a very small constant, like 0.01.

happiness_replace <- happiness %>%
                        mutate(PerceptionCorrupt = replace(PerceptionCorrupt,
                                                           is.na(PerceptionCorrupt), 0))

There is an entire literature dedicated to how to handle missing data. A good overview of some of the theory and fancy methods to deal with missing data can be read here for those interested.

Summarising

One of the most useful tools in the dplyr package is the summarise() function. This function allows you to quickly generate summary statistics from your data in a simple and streamlined way.

happiness %>%
  group_by(Region) %>%
  summarise(n = n(), mean = mean(Score), sd = sd(Score)) %>%
  arrange(desc(mean))

## # A tibble: 9 x 4
##   Region              n  mean     sd
##   <chr>           <int> <dbl>  <dbl>
## 1 Oceania            10  7.29 0.0308
## 2 North America      16  7.03 0.301 
## 3 Central America    38  6.18 0.552 
## 4 South America      45  6.14 0.521 
## 5 Europe            220  6.13 0.885 
## 6 Asia              129  5.26 0.664 
## 7 Carribbean         20  5.23 0.913 
## 8 Middle East        86  5.12 1.12  
## 9 Africa            218  4.30 0.687

Now we have generated a summary table that shows the average happiness score in each region across all years, along with the standard deviation. The chunk of code n = n() tells R to count how many cases of the data went into calculating the summary statistics you asked for and display them in a column called n. That way we know how our sample sizes in addition to the means and standard deviations. We can easily generate a nice figure with just a few extra lines of code.

happiness %>%
  group_by(Region) %>%
  summarise(mean = mean(Score), sd = sd(Score)) %>%
  ggplot() +
  geom_col(aes(x = Region, y = mean, fill = Region)) +
  geom_errorbar(aes(x = Region, ymin = mean - sd, ymax = mean + sd), width = 0.3) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + #Adjusts angle of the x-axis labels 
  ylab("Happiness")

We will go over the ggplot2 package for making plots in more depth later on. But as you can see, ggplot2 makes it pretty easy to visualize your data for the most part. I like using geom_boxplot() along with geom_jitter() to display all of the data at once.

happiness %>%
  ggplot() +
  geom_boxplot(aes(x = Region, y = Score, col = Region)) +
  geom_jitter(aes(x = Region, y = Score, col = Region), alpha = 0.3) + #alpha = 0.3 changes the transparency of the points
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ylab("Happiness")

That’s all for this post. Next time, we will explore the ggplot2 package in a bit more detail.

Thanks for reading!

An introduction to data wrangling

Featuring `dplyr`

Jake Kvistad

September 12, 2020

Filtering

Arranging

Slicing

Sampling

Selecting

Mutating

Dealing with NAs

Summarising

An introduction to data wrangling

Featuring dplyr

Jake Kvistad

September 12, 2020

Filtering

Arranging

Slicing

Sampling

Selecting

Mutating

Dealing with NAs

Summarising

Featuring `dplyr`