Setting Up Our Environment

Installing and Loading R Packages

In R, we distinguish between installing and loading packages:

Installing is like downloading an app (which only needs to be done once)

For today’s session, we’ll focus on two essential packages:

# Only run these once to install the packages
install.packages("tidyverse")  # Core data science tools
install.packages("gapminder")  # Dataset we'll use today

Note: the tidyverse is a collection of different packages, all with the same general syntax and philosophy. It is the most widely used set of tools, from data manipulation to modeling and visualization.

There is also a (free) full, comprehenive, acclaimed book attached to it:

https://r4ds.hadley.nz/

Loading is like opening an app (needs to be done each new session – i.e., if you close and open R) to use its features, which are functions. It’s like telling R that you want to run “this command” in “this way”.

Now let’s load our packages:

library(tidyverse)   # Core tools for data manipulation
library(gapminder)   # Dataset about global development

If you see red text after running these commands, don’t worry! This is normal - R is just telling you what it’s loading. If you see an error message that includes “there is no package called…”, you’ll need to install the package first using the install.packages() command above.

If you want to automate both checking if you installed specified packages and load them, here’s a super useful code block where you only need to replace the package names.

# List of packages
packages <- c("tidyverse", "gapminder") # add any you need here

# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Load the packages
lapply(packages, library, character.only = TRUE)

## [[1]]
##  [1] "gapminder" "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "gapminder" "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"

Understanding R’s Basic Syntax

Before we dive into data analysis, let’s understand some fundamental concepts in R. Think of these as the building blocks we’ll use throughout the course.

Objects and Assignment

In R, we store information in objects using the assignment operator (<-):

age <- 25

Seems simple, but it can be leveraged for all kinds of tasks. For instance, we can now print what is stored in “age”.

Printing by using ‘print’() function will display the information stored in an object.

print(age)

## [1] 25

It also means we can change the name to anything we would like and it will work.

info <- age
print(info)

## [1] 25

Now let’s practice a few other examples:

# Storing text (called a "string" in programming)
name <- "Alex"

# Storing multiple numbers (called a "vector")
ages <- c(23, 45, 67, 89)

The arrow (<-) means “store the right side in the name on the left.” This is powerful because:

We can reuse values without retyping them
We can update values while keeping our code the same
We can build more involved operations step by step

Now suppose you wanted to know what was stored in the object ‘name’ and ‘ages’, what would you do to check?

print(ages)

## [1] 23 45 67 89

print(name)

## [1] "Alex"

Functions and Arguments

Functions perform operations on data. They follow this pattern: function_name(argument1, argument2, ...)

For example:

# Rounding a number
round(3.14159, digits = 2)

## [1] 3.14

# Finding the mean of several numbers
mean(c(1, 2, 3, 4, 5))

## [1] 3

# Getting help about a function
?mean  # Try this in your R console!

Understanding this pattern is crucial because:

All R operations use this basic structure
It helps you read and write code
It makes documentation easier to understand

Working with Real Social Science Data

Introduction to the Gapminder Dataset

For this session, we’ll work with the Gapminder dataset, which contains information about life expectancy, population, and GDP per capita for countries around the world over time. This dataset is perfect for learning because:

It contains real, meaningful social science data
It has a manageable number of variables
It includes different types of data (numbers, categories, years)
It tells stories about global development

It also has extensively been covered in other tutorials because it is a ‘clean’ or what many term a ‘learning’ dataset. There are also a ton of resources and videos attached to it, including some found here: https://www.gapminder.org/resources/

Let’s take our first look at the data:

head(gapminder)

## # A tibble: 6 × 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

Before we start analyzing, let’s understand what we’re looking at:

Each row represents a data point or observation in a dataset (in this case, country, in a specific year)
The columns (variables) are:
- country: Name of the country
- continent: Which continent the country is in
- year: The year of observation
- lifeExp: Life expectancy in years
- pop: Total population
- gdpPercap: GDP per capita

Understanding Data Structure

Before we analyze data, we need to understand its structure. R provides several useful functions for this:

# Get basic information about the dataset
str(gapminder)

## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

This tells us:

How many rows (observations – in this case, 1,704) and columns (variables) we have (in this case, 6)

# Here's another way to check
ncol(gapminder)

## [1] 6

nrow(gapminder)

## [1] 1704

The type of each variable (e.g., numeric, character)

# Here's another way to do it, where it returns as True/False
is.factor(gapminder$country)

## [1] TRUE

is.numeric(gapminder$country)

## [1] FALSE

The first few values of each variable

# Here's another way to do it
head(gapminder)

## # A tibble: 6 × 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

tail(gapminder)

## # A tibble: 6 × 6
##   country  continent  year lifeExp      pop gdpPercap
##   <fct>    <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Zimbabwe Africa     1982    60.4  7636524      789.
## 2 Zimbabwe Africa     1987    62.4  9216418      706.
## 3 Zimbabwe Africa     1992    60.4 10704340      693.
## 4 Zimbabwe Africa     1997    46.8 11404948      792.
## 5 Zimbabwe Africa     2002    40.0 11926563      672.
## 6 Zimbabwe Africa     2007    43.5 12311143      470.

# If you're in R Studio
#view(gapminder)

We can also get a quick statistical summary with simply doing ‘summary’. Although note that it would be long and ineffective for a larger dataset with variables still not yet cleaned or many of them (e.g., some datasets have thousands of variables).

# Get summary statistics for each variable
summary(gapminder)

##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
##

This shows us:

For numeric variables: minimum, maximum, mean, median, quartiles
For categorical variables: how many times each value appears

Note: we will do a more comprehensive overview of descriptive statistics in the next session.

the $ operator

It is a really useful operator to choose elements from a list or dataset. A basic structure would go as ‘dataset’‘$’‘column_name’.

For example, with the table function:

table(gapminder$year)

## 
## 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 
##  142  142  142  142  142  142  142  142  142  142  142  142

You can also combine what we’ve learnt thus far. Store a specific operation by naming it, and then printing it to see / check the output.

example <- table(gapminder$continent)
print(example)

## 
##   Africa Americas     Asia   Europe  Oceania 
##      624      300      396      360       24

Before any analysis, it’s crucial as well to check data quality. Here are some key checks:

# Check for missing values
colSums(is.na(gapminder))

##   country continent      year   lifeExp       pop gdpPercap 
##         0         0         0         0         0         0

# Check unique values in categorical variables
n_distinct(gapminder$country)

## [1] 142

n_distinct(gapminder$continent)

## [1] 5

# Check the range of years
range(gapminder$year)

## [1] 1952 2007

Note: this is an already cleaned – or learning – dataset. Meaning there are no missing values or values to recode or set to NA. In the future, we will have to do so before proceeding with any analysis.

This is crucial because:

Missing data can affect our analysis and must be handled
We should understand the scope of our data

In the next section, we’ll learn how to manipulate this data to answer specific questions about global development. But first, let’s practice what we’ve learned.

Core Data Manipulation Skills

The Power of the Pipe: %>%

Before we dive into data manipulation, we need to understand one of the most powerful features in R: the pipe operator (%>%). The pipe takes the output of one operation and feeds it into the next one (i.e., as a sequence of operations).

Let’s compare approaches. Say we want to find the average life expectancy in our dataset:

Without the pipe:

# Traditional nested approach
mean(gapminder$lifeExp)

## [1] 59.47444

With the pipe:

# Pipe approach
gapminder %>% 
  pull(lifeExp) %>% 
  mean()

## [1] 59.47444

Now you may ask why use the pipe, the first instance was shorter to code? The pipe becomes invaluable when we perform multiple operations. The pipe is also powerful because it:

Makes code read left-to-right, like English
Lets us build analysis step-by-step
Makes longer, more involved operations easier to understand
Reduces the need for intermediate objects

Selecting Variables with select()

Often, we want to focus on specific variables in our dataset. The select() function helps us do this. Think of it as choosing which columns you want to work with.

Basic Selection

Let’s start with the simplest case - selecting a few specific variables:

# Select just country, year, and life expectancy
gapminder %>%
  select(country, year, lifeExp)

## # A tibble: 1,704 × 3
##    country      year lifeExp
##    <fct>       <int>   <dbl>
##  1 Afghanistan  1952    28.8
##  2 Afghanistan  1957    30.3
##  3 Afghanistan  1962    32.0
##  4 Afghanistan  1967    34.0
##  5 Afghanistan  1972    36.1
##  6 Afghanistan  1977    38.4
##  7 Afghanistan  1982    39.9
##  8 Afghanistan  1987    40.8
##  9 Afghanistan  1992    41.7
## 10 Afghanistan  1997    41.8
## # ℹ 1,694 more rows

Notice how: - We start with our dataset (gapminder)

We pipe it into select()
We list the variables we want to keep
The output only shows these three columns

But suppose you are really sure you only wish to work with those three variables. You can store what we call a subsetted dataset with the ‘<-’ operator that you encountered earlier. This is super useful for when you are dealing with large datasets. Here’s how it works in practice:

df <- gapminder %>%
  select(country, year, lifeExp)
df

## # A tibble: 1,704 × 3
##    country      year lifeExp
##    <fct>       <int>   <dbl>
##  1 Afghanistan  1952    28.8
##  2 Afghanistan  1957    30.3
##  3 Afghanistan  1962    32.0
##  4 Afghanistan  1967    34.0
##  5 Afghanistan  1972    36.1
##  6 Afghanistan  1977    38.4
##  7 Afghanistan  1982    39.9
##  8 Afghanistan  1987    40.8
##  9 Afghanistan  1992    41.7
## 10 Afghanistan  1997    41.8
## # ℹ 1,694 more rows

Renaming While Selecting

We can rename variables while selecting them:

gapminder %>%
  select(
    nation = country,      # 'country' becomes 'nation'
    year,
    life_expectancy = lifeExp  # 'lifeExp' becomes 'life_expectancy'
  )

## # A tibble: 1,704 × 3
##    nation       year life_expectancy
##    <fct>       <int>           <dbl>
##  1 Afghanistan  1952            28.8
##  2 Afghanistan  1957            30.3
##  3 Afghanistan  1962            32.0
##  4 Afghanistan  1967            34.0
##  5 Afghanistan  1972            36.1
##  6 Afghanistan  1977            38.4
##  7 Afghanistan  1982            39.9
##  8 Afghanistan  1987            40.8
##  9 Afghanistan  1992            41.7
## 10 Afghanistan  1997            41.8
## # ℹ 1,694 more rows

This is useful when:

You want more readable names
You need to standardize names across datasets

But always check that it worked as intended.

Filtering Observations with filter()

While select() chooses columns, filter() chooses rows based on conditions. This is how we focus on specific cases we’re interested in.

Basic Filtering

Let’s start with simple conditions. Suppose you only wanted to show data from a specific year.

Well, first, identify the years available:

table(gapminder$year)

## 
## 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 
##  142  142  142  142  142  142  142  142  142  142  142  142

Then, filter to that specific year only:

gapminder %>%
  filter(year == 2007)

## # A tibble: 142 × 6
##    country     continent  year lifeExp       pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Afghanistan Asia       2007    43.8  31889923      975.
##  2 Albania     Europe     2007    76.4   3600523     5937.
##  3 Algeria     Africa     2007    72.3  33333216     6223.
##  4 Angola      Africa     2007    42.7  12420476     4797.
##  5 Argentina   Americas   2007    75.3  40301927    12779.
##  6 Australia   Oceania    2007    81.2  20434176    34435.
##  7 Austria     Europe     2007    79.8   8199783    36126.
##  8 Bahrain     Asia       2007    75.6    708573    29796.
##  9 Bangladesh  Asia       2007    64.1 150448339     1391.
## 10 Belgium     Europe     2007    79.4  10392226    33693.
## # ℹ 132 more rows

Now note that above we did not retain the change to nation as a renamed variable. Why is that? Simple: we did not store (which is like ‘saving’ to a named object). So we did the operation, but it was not stored anywhere. If you want the renaming to stay moving forward in your project, you would need to make sure to use the assignment operator and name it something. For instance:

df <- gapminder %>%
  select(
    nation = country,      # 'country' becomes 'nation'
    year,
    life_expectancy = lifeExp  # 'lifeExp' becomes 'life_expectancy'
  )

We can now compare the dataset stored as ‘df’ (columns renamed) and ‘gapminder’ (original), to see if it worked. But it also means we can always backtrack to the original if we do not ‘overwrite’ and remember the name.

df

## # A tibble: 1,704 × 3
##    nation       year life_expectancy
##    <fct>       <int>           <dbl>
##  1 Afghanistan  1952            28.8
##  2 Afghanistan  1957            30.3
##  3 Afghanistan  1962            32.0
##  4 Afghanistan  1967            34.0
##  5 Afghanistan  1972            36.1
##  6 Afghanistan  1977            38.4
##  7 Afghanistan  1982            39.9
##  8 Afghanistan  1987            40.8
##  9 Afghanistan  1992            41.7
## 10 Afghanistan  1997            41.8
## # ℹ 1,694 more rows

gapminder

## # A tibble: 1,704 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ℹ 1,694 more rows

Now as stated above, we can chain operations. Let’s work from the original gapminder:

df <- gapminder %>%
  filter(year == 2007) %>% # to add or 'pipe'
  select(
    nation = country,     
    year,
    life_expectancy = lifeExp
  )

df

## # A tibble: 142 × 3
##    nation       year life_expectancy
##    <fct>       <int>           <dbl>
##  1 Afghanistan  2007            43.8
##  2 Albania      2007            76.4
##  3 Algeria      2007            72.3
##  4 Angola       2007            42.7
##  5 Argentina    2007            75.3
##  6 Australia    2007            81.2
##  7 Austria      2007            79.8
##  8 Bahrain      2007            75.6
##  9 Bangladesh   2007            64.1
## 10 Belgium      2007            79.4
## # ℹ 132 more rows

Note here we overwrote the prior ‘df’ since we used the same name. Often, you might want to keep working while retaining naming consistences (e.g., your processed dataset as ‘df’). But, you might also want to have different names to backtrack or not overwrite – in that case, if you have multiple names for different operations used to process your dataset, you want to make sure you keep track of it all.

Multiple Conditions

We can also combine multiple conditions:

# Show European countries in 2007 with life expectancy over 75
gapminder %>%
  filter(
    continent == "Europe",
    year == 2007,
    lifeExp > 75
  )

## # A tibble: 22 × 6
##    country        continent  year lifeExp      pop gdpPercap
##    <fct>          <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Albania        Europe     2007    76.4  3600523     5937.
##  2 Austria        Europe     2007    79.8  8199783    36126.
##  3 Belgium        Europe     2007    79.4 10392226    33693.
##  4 Croatia        Europe     2007    75.7  4493312    14619.
##  5 Czech Republic Europe     2007    76.5 10228744    22833.
##  6 Denmark        Europe     2007    78.3  5468120    35278.
##  7 Finland        Europe     2007    79.3  5238460    33207.
##  8 France         Europe     2007    80.7 61083916    30470.
##  9 Germany        Europe     2007    79.4 82400996    32170.
## 10 Greece         Europe     2007    79.5 10706290    27538.
## # ℹ 12 more rows

When you list conditions with commas, R requires ALL conditions to be true (AND logic).

OR Conditions

Sometimes we want rows that meet ANY of our conditions. We use the OR operator (|):

# Show data for either Europe or Asia
gapminder %>%
  filter(continent == "Europe" | continent == "Asia")

## # A tibble: 756 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ℹ 746 more rows

Note the use of == for comparison. Here are the main comparison operators:

== equals
!= does not equal
> greater than
< less than
>= greater than or equal to
<= less than or equal to

Special Filtering Functions

Some helpful filtering functions:

# Show countries with population between 1 million and 10 million in 2007
gapminder %>%
  filter(
    year == 2007,
    between(pop, 1000000, 10000000)
  )

## # A tibble: 58 × 6
##    country                  continent  year lifeExp     pop gdpPercap
##    <fct>                    <fct>     <int>   <dbl>   <int>     <dbl>
##  1 Albania                  Europe     2007    76.4 3600523     5937.
##  2 Austria                  Europe     2007    79.8 8199783    36126.
##  3 Benin                    Africa     2007    56.7 8078314     1441.
##  4 Bolivia                  Americas   2007    65.6 9119152     3822.
##  5 Bosnia and Herzegovina   Europe     2007    74.9 4552198     7446.
##  6 Botswana                 Africa     2007    50.7 1639131    12570.
##  7 Bulgaria                 Europe     2007    73.0 7322858    10681.
##  8 Burundi                  Africa     2007    49.6 8390505      430.
##  9 Central African Republic Africa     2007    44.7 4369038      706.
## 10 Congo, Rep.              Africa     2007    55.3 3800610     3633.
## # ℹ 48 more rows

# Show specific countries
gapminder %>%
  filter(country %in% c("Sweden", "Norway", "Denmark"))

## # A tibble: 36 × 6
##    country continent  year lifeExp     pop gdpPercap
##    <fct>   <fct>     <int>   <dbl>   <int>     <dbl>
##  1 Denmark Europe     1952    70.8 4334000     9692.
##  2 Denmark Europe     1957    71.8 4487831    11100.
##  3 Denmark Europe     1962    72.4 4646899    13583.
##  4 Denmark Europe     1967    73.0 4838800    15937.
##  5 Denmark Europe     1972    73.5 4991596    18866.
##  6 Denmark Europe     1977    74.7 5088419    20423.
##  7 Denmark Europe     1982    74.6 5117810    21688.
##  8 Denmark Europe     1987    74.8 5127024    25116.
##  9 Denmark Europe     1992    75.3 5171393    26407.
## 10 Denmark Europe     1997    76.1 5283663    29804.
## # ℹ 26 more rows

Common Filtering Mistakes

Here are some common mistakes to avoid:

# WRONG way - using = instead of ==
# gapminder %>% filter(continent = "Europe")

# WRONG way - using AND instead of &
# gapminder %>% filter(continent == "Europe" AND year == 2007)

# WRONG way - forgetting quotes for text
# gapminder %>% filter(continent == Europe)

Combining Select and Filter

As stated before, the real power comes from combining (or ‘chaining’ / ‘piping’) these operations:

# Look at life expectancy in large European countries
gapminder %>%
  # First, filter for our cases of interest
  filter(
    continent == "Europe",
    year == 2007,
    pop > 5000000
  ) %>%
  # Then select just the variables we want to see
  select(country, pop, lifeExp) %>%
  # Finally, arrange by life expectancy
  arrange(desc(lifeExp))

## # A tibble: 22 × 3
##    country             pop lifeExp
##    <fct>             <int>   <dbl>
##  1 Switzerland     7554661    81.7
##  2 Spain          40448191    80.9
##  3 Sweden          9031088    80.9
##  4 France         61083916    80.7
##  5 Italy          58147733    80.5
##  6 Austria         8199783    79.8
##  7 Netherlands    16570613    79.8
##  8 Greece         10706290    79.5
##  9 Belgium        10392226    79.4
## 10 United Kingdom 60776238    79.4
## # ℹ 12 more rows

In this example, we:

Start with our full dataset (gapminder)
Filter to specific cases we’re interested in
Select just the variables we need
Arrange the results in a meaningful order

Understanding Your Results

After any data manipulation, it’s crucial to check your results:

Do the numbers make sense?
- Are values in expected ranges?
- Are there any suspicious patterns?
Did you get the cases you expected?
- Are important countries/cases included?
- Are the years what you wanted?
How many cases did you get?
- Does the number of rows make sense?
- Did you filter out too much or too little?

Use these functions to check your work, notably to see if you removed observations. One thing that will happen in your R journey is that you will inadvertently removing ALL observations (i.e., O obs). So always check to make sure you did not introduce errors or important issues. As we will in a later session, at times we want to remove specific rows if say they contain all NA values or say if they are duplicates. In that case, the number of observations is expected to at least go slightly down. But in the case below it should not:

gapminder %>%
  filter(year == 2007) %>%
  select(country, lifeExp) %>%
  # How many rows?
  nrow()

## [1] 142

Still 142 obs. If you are working in your R Studio environment, you can actually check this directly by looking at the top right window.

In the next section, we’ll learn about creating new variables and calculating summaries, building on these fundamental skills.

Creating New Variables with mutate()

Often, we need to create new variables based on existing ones. The mutate() function helps us do this.

Basic Calculations

Let’s start with simple arithmetic:

# Calculate total GDP
d <- gapminder %>%
  mutate(
    gdp_total = pop * gdpPercap,  # Multiply population by GDP per capita
    gdp_billion = gdp_total / 1e9  # Convert to billions
  ) %>%
  select(country, year, gdp_total, gdp_billion)

Note how:

We can create multiple new variables at once
We can use variables we just created (gdp_total)
The new variables are added to the right of the dataset

Let’s check what we did:

## # A tibble: 1,704 × 4
##    country      year    gdp_total gdp_billion
##    <fct>       <int>        <dbl>       <dbl>
##  1 Afghanistan  1952  6567086330.        6.57
##  2 Afghanistan  1957  7585448670.        7.59
##  3 Afghanistan  1962  8758855797.        8.76
##  4 Afghanistan  1967  9648014150.        9.65
##  5 Afghanistan  1972  9678553274.        9.68
##  6 Afghanistan  1977 11697659231.       11.7 
##  7 Afghanistan  1982 12598563401.       12.6 
##  8 Afghanistan  1987 11820990309.       11.8 
##  9 Afghanistan  1992 10595901589.       10.6 
## 10 Afghanistan  1997 14121995875.       14.1 
## # ℹ 1,694 more rows

Using Functions in mutate()

We can use any R function within mutate():

# Create rounded and logged versions of population
pop <- gapminder %>%
  mutate(
    pop_million = round(pop / 1e6, 1),  # Population in millions, rounded to 1 decimal
    pop_log = log(pop)  # Natural log of population
  ) %>%
  select(country, year, pop, pop_million, pop_log)

pop

## # A tibble: 1,704 × 5
##    country      year      pop pop_million pop_log
##    <fct>       <int>    <int>       <dbl>   <dbl>
##  1 Afghanistan  1952  8425333         8.4    15.9
##  2 Afghanistan  1957  9240934         9.2    16.0
##  3 Afghanistan  1962 10267083        10.3    16.1
##  4 Afghanistan  1967 11537966        11.5    16.3
##  5 Afghanistan  1972 13079460        13.1    16.4
##  6 Afghanistan  1977 14880372        14.9    16.5
##  7 Afghanistan  1982 12881816        12.9    16.4
##  8 Afghanistan  1987 13867957        13.9    16.4
##  9 Afghanistan  1992 16317921        16.3    16.6
## 10 Afghanistan  1997 22227415        22.2    16.9
## # ℹ 1,694 more rows

Common functions used in mutate():

round(): Round numbers
log(): Natural logarithm
sqrt(): Square root
abs(): Absolute value

Creating Categories with case_when()

Often, we want to create categories based on values:

gpc <- gapminder %>%
  mutate(
    development_level = case_when(
      gdpPercap < 1000 ~ "Low income",
      gdpPercap < 10000 ~ "Middle income",
      TRUE ~ "High income"  # Default case
    )
  ) %>%
  select(country, year, gdpPercap, development_level)

gpc

## # A tibble: 1,704 × 4
##    country      year gdpPercap development_level
##    <fct>       <int>     <dbl> <chr>            
##  1 Afghanistan  1952      779. Low income       
##  2 Afghanistan  1957      821. Low income       
##  3 Afghanistan  1962      853. Low income       
##  4 Afghanistan  1967      836. Low income       
##  5 Afghanistan  1972      740. Low income       
##  6 Afghanistan  1977      786. Low income       
##  7 Afghanistan  1982      978. Low income       
##  8 Afghanistan  1987      852. Low income       
##  9 Afghanistan  1992      649. Low income       
## 10 Afghanistan  1997      635. Low income       
## # ℹ 1,694 more rows

case_when() is powerful because:

It can handle multiple conditions
Conditions are checked in order
You can set a default with TRUE
It’s more readable than nested if-else

Basic Data Summaries

While we’ll dive deeper into descriptive statistics next session, let’s start our journey and look at some basic summaries.

Counting Cases

The simplest summary is counting:

# How many observations per continent?
gapminder %>%
  count(continent, sort = TRUE)

## # A tibble: 5 × 2
##   continent     n
##   <fct>     <int>
## 1 Africa      624
## 2 Asia        396
## 3 Europe      360
## 4 Americas    300
## 5 Oceania      24

# How many countries per continent in 2007?
gapminder %>%
  filter(year == 2007) %>%
  count(continent, sort = TRUE)

## # A tibble: 5 × 2
##   continent     n
##   <fct>     <int>
## 1 Africa       52
## 2 Asia         33
## 3 Europe       30
## 4 Americas     25
## 5 Oceania       2

Basic Statistics

We can calculate basic statistics for our variables:

# Summary statistics for life expectancy
gapminder %>%
  summarise(
    mean_life = mean(lifeExp),
    median_life = median(lifeExp),
    min_life = min(lifeExp),
    max_life = max(lifeExp),
    sd_life = sd(lifeExp)
  )

## # A tibble: 1 × 5
##   mean_life median_life min_life max_life sd_life
##       <dbl>       <dbl>    <dbl>    <dbl>   <dbl>
## 1      59.5        60.7     23.6     82.6    12.9

Grouped Summaries

Most often, we want summaries by groups:

# Life expectancy statistics by continent in 2007
gapminder %>%
  filter(year == 2007) %>%
  group_by(continent) %>%
  summarise(
    countries = n(),
    mean_life = mean(lifeExp),
    min_life = min(lifeExp),
    max_life = max(lifeExp)
  ) %>%
  arrange(desc(mean_life))

## # A tibble: 5 × 5
##   continent countries mean_life min_life max_life
##   <fct>         <int>     <dbl>    <dbl>    <dbl>
## 1 Oceania           2      80.7     80.2     81.2
## 2 Europe           30      77.6     71.8     81.8
## 3 Americas         25      73.6     60.9     80.7
## 4 Asia             33      70.7     43.8     82.6
## 5 Africa           52      54.8     39.6     76.4

Best Practices for Data Analysis

As we conclude this introduction to R, here are key practices to remember:

Start with Questions
- What do you want to know?
- What do you need to explore to answer your question?
Check Your Data
- Look at the structure
- Check for missing values
- Verify variable types
- Examine value ranges
Build Analysis Gradually
- Start with simple operations
- Add complexity step by step
- Check results at each stage
- Keep track of what you’ve done

Practice Exercises (to leverage for weekly diary!)

Here are 5 exercises to help you practice what we covered today (using the same data and skills).

Exercise 1: Basic Objects

Practice creating and printing objects:

Create an object called ‘x’ and store the number 42 in it
Create an object called ‘y’ and store the text “hello” in it
Print both objects
Check if ‘x’ is numeric using is.numeric()
Check if ‘y’ is numeric using is.numeric()

Hint: Remember how we used the arrow (<-) for assignment and print() function

Exercise 2: Variable Selection

Using select(), create a new dataset that includes:

country
year
population
GDP per capita

Hint: This is just like what we did with life expectancy, but choosing different variables

Exercise 3: Filtering Data

Filter the gapminder dataset to show:

Only data from the year 1997
Only countries from Africa
Only these two variables: country and population

Hint: Remember how we filtered for 2007 and Europe? Just change those values

Exercise 4: Creating New Variables

Create a new dataset that:

Starts with data from 2007
Creates a new column that converts population to millions
Shows only: country, continent, population, and your new population in millions column

Hint: Look at how we converted GDP to billions, but use different math for millions

Exercise 5: Basic Counting

Count how many countries are in each continent in the year 2007.

Hint: Remember how we used count() with continent

For your weekly diary reflection, consider:

Which exercises could you complete successfully?
Where did you get stuck?
What helped you overcome any challenges?
What would you like to practice more?

Remember: The goal is learning! Share your experiences, ask questions, and help others when you can.

Week 2. Session 1: Exploring Data

Sébastien Parker

SOC3320 - Winter 2025