In R, we distinguish between installing and loading packages:
Installing is like downloading an app (which only needs to be done once)
For today’s session, we’ll focus on two essential packages:
# Only run these once to install the packages
install.packages("tidyverse") # Core data science tools
install.packages("gapminder") # Dataset we'll use today
Note: the tidyverse is a collection of different packages, all with the same general syntax and philosophy. It is the most widely used set of tools, from data manipulation to modeling and visualization.
There is also a (free) full, comprehenive, acclaimed book attached to it:
Loading is like opening an app (needs to be done each new session – i.e., if you close and open R) to use its features, which are functions. It’s like telling R that you want to run “this command” in “this way”.
Now let’s load our packages:
library(tidyverse) # Core tools for data manipulation
library(gapminder) # Dataset about global development
If you see red text after running these commands, don’t worry! This is normal - R is just telling you what it’s loading. If you see an error message that includes “there is no package called…”, you’ll need to install the package first using the install.packages() command above.
If you want to automate both checking if you installed specified packages and load them, here’s a super useful code block where you only need to replace the package names.
# List of packages
packages <- c("tidyverse", "gapminder") # add any you need here
# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
# Load the packages
lapply(packages, library, character.only = TRUE)
## [[1]]
## [1] "gapminder" "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [7] "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [13] "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[2]]
## [1] "gapminder" "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [7] "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [13] "graphics" "grDevices" "utils" "datasets" "methods" "base"
Before we dive into data analysis, let’s understand some fundamental concepts in R. Think of these as the building blocks we’ll use throughout the course.
In R, we store information in objects using the assignment operator
(<-
):
Seems simple, but it can be leveraged for all kinds of tasks. For instance, we can now print what is stored in “age”.
Printing by using ‘print’() function will display the information stored in an object.
## [1] 25
It also means we can change the name to anything we would like and it will work.
## [1] 25
Now let’s practice a few other examples:
# Storing text (called a "string" in programming)
name <- "Alex"
# Storing multiple numbers (called a "vector")
ages <- c(23, 45, 67, 89)
The arrow (<-
) means “store the right side in the
name on the left.” This is powerful because:
We can reuse values without retyping them
We can update values while keeping our code the same
We can build more involved operations step by step
Now suppose you wanted to know what was stored in the object ‘name’ and ‘ages’, what would you do to check?
## [1] 23 45 67 89
## [1] "Alex"
Functions perform operations on data. They follow this pattern:
function_name(argument1, argument2, ...)
For example:
## [1] 3.14
## [1] 3
Understanding this pattern is crucial because:
All R operations use this basic structure
It helps you read and write code
It makes documentation easier to understand
Before we dive into data manipulation, we need to understand one of
the most powerful features in R: the pipe operator
(%>%
). The pipe takes the output of one operation and
feeds it into the next one (i.e., as a sequence of operations).
Let’s compare approaches. Say we want to find the average life expectancy in our dataset:
Without the pipe:
## [1] 59.47444
With the pipe:
## [1] 59.47444
Now you may ask why use the pipe, the first instance was shorter to code? The pipe becomes invaluable when we perform multiple operations. The pipe is also powerful because it:
Makes code read left-to-right, like English
Lets us build analysis step-by-step
Makes longer, more involved operations easier to understand
Reduces the need for intermediate objects
Often, we want to focus on specific variables in our dataset. The
select()
function helps us do this. Think of it as choosing
which columns you want to work with.
Let’s start with the simplest case - selecting a few specific variables:
## # A tibble: 1,704 × 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Afghanistan 1952 28.8
## 2 Afghanistan 1957 30.3
## 3 Afghanistan 1962 32.0
## 4 Afghanistan 1967 34.0
## 5 Afghanistan 1972 36.1
## 6 Afghanistan 1977 38.4
## 7 Afghanistan 1982 39.9
## 8 Afghanistan 1987 40.8
## 9 Afghanistan 1992 41.7
## 10 Afghanistan 1997 41.8
## # ℹ 1,694 more rows
Notice how: - We start with our dataset (gapminder)
We pipe it into select()
We list the variables we want to keep
The output only shows these three columns
But suppose you are really sure you only wish to work with those three variables. You can store what we call a subsetted dataset with the ‘<-’ operator that you encountered earlier. This is super useful for when you are dealing with large datasets. Here’s how it works in practice:
## # A tibble: 1,704 × 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Afghanistan 1952 28.8
## 2 Afghanistan 1957 30.3
## 3 Afghanistan 1962 32.0
## 4 Afghanistan 1967 34.0
## 5 Afghanistan 1972 36.1
## 6 Afghanistan 1977 38.4
## 7 Afghanistan 1982 39.9
## 8 Afghanistan 1987 40.8
## 9 Afghanistan 1992 41.7
## 10 Afghanistan 1997 41.8
## # ℹ 1,694 more rows
We can rename variables while selecting them:
gapminder %>%
select(
nation = country, # 'country' becomes 'nation'
year,
life_expectancy = lifeExp # 'lifeExp' becomes 'life_expectancy'
)
## # A tibble: 1,704 × 3
## nation year life_expectancy
## <fct> <int> <dbl>
## 1 Afghanistan 1952 28.8
## 2 Afghanistan 1957 30.3
## 3 Afghanistan 1962 32.0
## 4 Afghanistan 1967 34.0
## 5 Afghanistan 1972 36.1
## 6 Afghanistan 1977 38.4
## 7 Afghanistan 1982 39.9
## 8 Afghanistan 1987 40.8
## 9 Afghanistan 1992 41.7
## 10 Afghanistan 1997 41.8
## # ℹ 1,694 more rows
This is useful when:
You want more readable names
You need to standardize names across datasets
But always check that it worked as intended.
While select()
chooses columns, filter()
chooses rows based on conditions. This is how we focus on specific cases
we’re interested in.
Let’s start with simple conditions. Suppose you only wanted to show data from a specific year.
Well, first, identify the years available:
##
## 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
## 142 142 142 142 142 142 142 142 142 142 142 142
Then, filter to that specific year only:
## # A tibble: 142 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2007 43.8 31889923 975.
## 2 Albania Europe 2007 76.4 3600523 5937.
## 3 Algeria Africa 2007 72.3 33333216 6223.
## 4 Angola Africa 2007 42.7 12420476 4797.
## 5 Argentina Americas 2007 75.3 40301927 12779.
## 6 Australia Oceania 2007 81.2 20434176 34435.
## 7 Austria Europe 2007 79.8 8199783 36126.
## 8 Bahrain Asia 2007 75.6 708573 29796.
## 9 Bangladesh Asia 2007 64.1 150448339 1391.
## 10 Belgium Europe 2007 79.4 10392226 33693.
## # ℹ 132 more rows
Now note that above we did not retain the change to nation as a renamed variable. Why is that? Simple: we did not store (which is like ‘saving’ to a named object). So we did the operation, but it was not stored anywhere. If you want the renaming to stay moving forward in your project, you would need to make sure to use the assignment operator and name it something. For instance:
df <- gapminder %>%
select(
nation = country, # 'country' becomes 'nation'
year,
life_expectancy = lifeExp # 'lifeExp' becomes 'life_expectancy'
)
We can now compare the dataset stored as ‘df’ (columns renamed) and ‘gapminder’ (original), to see if it worked. But it also means we can always backtrack to the original if we do not ‘overwrite’ and remember the name.
## # A tibble: 1,704 × 3
## nation year life_expectancy
## <fct> <int> <dbl>
## 1 Afghanistan 1952 28.8
## 2 Afghanistan 1957 30.3
## 3 Afghanistan 1962 32.0
## 4 Afghanistan 1967 34.0
## 5 Afghanistan 1972 36.1
## 6 Afghanistan 1977 38.4
## 7 Afghanistan 1982 39.9
## 8 Afghanistan 1987 40.8
## 9 Afghanistan 1992 41.7
## 10 Afghanistan 1997 41.8
## # ℹ 1,694 more rows
## # A tibble: 1,704 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ℹ 1,694 more rows
Now as stated above, we can chain operations. Let’s work from the original gapminder:
df <- gapminder %>%
filter(year == 2007) %>% # to add or 'pipe'
select(
nation = country,
year,
life_expectancy = lifeExp
)
## # A tibble: 142 × 3
## nation year life_expectancy
## <fct> <int> <dbl>
## 1 Afghanistan 2007 43.8
## 2 Albania 2007 76.4
## 3 Algeria 2007 72.3
## 4 Angola 2007 42.7
## 5 Argentina 2007 75.3
## 6 Australia 2007 81.2
## 7 Austria 2007 79.8
## 8 Bahrain 2007 75.6
## 9 Bangladesh 2007 64.1
## 10 Belgium 2007 79.4
## # ℹ 132 more rows
Note here we overwrote the prior ‘df’ since we used the same name. Often, you might want to keep working while retaining naming consistences (e.g., your processed dataset as ‘df’). But, you might also want to have different names to backtrack or not overwrite – in that case, if you have multiple names for different operations used to process your dataset, you want to make sure you keep track of it all.
We can also combine multiple conditions:
# Show European countries in 2007 with life expectancy over 75
gapminder %>%
filter(
continent == "Europe",
year == 2007,
lifeExp > 75
)
## # A tibble: 22 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Albania Europe 2007 76.4 3600523 5937.
## 2 Austria Europe 2007 79.8 8199783 36126.
## 3 Belgium Europe 2007 79.4 10392226 33693.
## 4 Croatia Europe 2007 75.7 4493312 14619.
## 5 Czech Republic Europe 2007 76.5 10228744 22833.
## 6 Denmark Europe 2007 78.3 5468120 35278.
## 7 Finland Europe 2007 79.3 5238460 33207.
## 8 France Europe 2007 80.7 61083916 30470.
## 9 Germany Europe 2007 79.4 82400996 32170.
## 10 Greece Europe 2007 79.5 10706290 27538.
## # ℹ 12 more rows
When you list conditions with commas, R requires ALL conditions to be true (AND logic).
Sometimes we want rows that meet ANY of our conditions. We use the OR
operator (|
):
# Show data for either Europe or Asia
gapminder %>%
filter(continent == "Europe" | continent == "Asia")
## # A tibble: 756 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ℹ 746 more rows
Note the use of ==
for comparison. Here are the main
comparison operators:
==
equals
!=
does not equal
>
greater than
<
less than
>=
greater than or equal to
<=
less than or equal to
Some helpful filtering functions:
# Show countries with population between 1 million and 10 million in 2007
gapminder %>%
filter(
year == 2007,
between(pop, 1000000, 10000000)
)
## # A tibble: 58 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Albania Europe 2007 76.4 3600523 5937.
## 2 Austria Europe 2007 79.8 8199783 36126.
## 3 Benin Africa 2007 56.7 8078314 1441.
## 4 Bolivia Americas 2007 65.6 9119152 3822.
## 5 Bosnia and Herzegovina Europe 2007 74.9 4552198 7446.
## 6 Botswana Africa 2007 50.7 1639131 12570.
## 7 Bulgaria Europe 2007 73.0 7322858 10681.
## 8 Burundi Africa 2007 49.6 8390505 430.
## 9 Central African Republic Africa 2007 44.7 4369038 706.
## 10 Congo, Rep. Africa 2007 55.3 3800610 3633.
## # ℹ 48 more rows
## # A tibble: 36 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Denmark Europe 1952 70.8 4334000 9692.
## 2 Denmark Europe 1957 71.8 4487831 11100.
## 3 Denmark Europe 1962 72.4 4646899 13583.
## 4 Denmark Europe 1967 73.0 4838800 15937.
## 5 Denmark Europe 1972 73.5 4991596 18866.
## 6 Denmark Europe 1977 74.7 5088419 20423.
## 7 Denmark Europe 1982 74.6 5117810 21688.
## 8 Denmark Europe 1987 74.8 5127024 25116.
## 9 Denmark Europe 1992 75.3 5171393 26407.
## 10 Denmark Europe 1997 76.1 5283663 29804.
## # ℹ 26 more rows
Here are some common mistakes to avoid:
As stated before, the real power comes from combining (or ‘chaining’ / ‘piping’) these operations:
# Look at life expectancy in large European countries
gapminder %>%
# First, filter for our cases of interest
filter(
continent == "Europe",
year == 2007,
pop > 5000000
) %>%
# Then select just the variables we want to see
select(country, pop, lifeExp) %>%
# Finally, arrange by life expectancy
arrange(desc(lifeExp))
## # A tibble: 22 × 3
## country pop lifeExp
## <fct> <int> <dbl>
## 1 Switzerland 7554661 81.7
## 2 Spain 40448191 80.9
## 3 Sweden 9031088 80.9
## 4 France 61083916 80.7
## 5 Italy 58147733 80.5
## 6 Austria 8199783 79.8
## 7 Netherlands 16570613 79.8
## 8 Greece 10706290 79.5
## 9 Belgium 10392226 79.4
## 10 United Kingdom 60776238 79.4
## # ℹ 12 more rows
In this example, we:
Start with our full dataset (gapminder)
Filter to specific cases we’re interested in
Select just the variables we need
Arrange the results in a meaningful order
After any data manipulation, it’s crucial to check your results:
Use these functions to check your work, notably to see if you removed observations. One thing that will happen in your R journey is that you will inadvertently removing ALL observations (i.e., O obs). So always check to make sure you did not introduce errors or important issues. As we will in a later session, at times we want to remove specific rows if say they contain all NA values or say if they are duplicates. In that case, the number of observations is expected to at least go slightly down. But in the case below it should not:
## [1] 142
Still 142 obs. If you are working in your R Studio environment, you can actually check this directly by looking at the top right window.
In the next section, we’ll learn about creating new variables and calculating summaries, building on these fundamental skills.
Often, we need to create new variables based on existing ones. The
mutate()
function helps us do this.
Let’s start with simple arithmetic:
# Calculate total GDP
d <- gapminder %>%
mutate(
gdp_total = pop * gdpPercap, # Multiply population by GDP per capita
gdp_billion = gdp_total / 1e9 # Convert to billions
) %>%
select(country, year, gdp_total, gdp_billion)
Note how:
We can create multiple new variables at once
We can use variables we just created (gdp_total)
The new variables are added to the right of the dataset
Let’s check what we did:
## # A tibble: 1,704 × 4
## country year gdp_total gdp_billion
## <fct> <int> <dbl> <dbl>
## 1 Afghanistan 1952 6567086330. 6.57
## 2 Afghanistan 1957 7585448670. 7.59
## 3 Afghanistan 1962 8758855797. 8.76
## 4 Afghanistan 1967 9648014150. 9.65
## 5 Afghanistan 1972 9678553274. 9.68
## 6 Afghanistan 1977 11697659231. 11.7
## 7 Afghanistan 1982 12598563401. 12.6
## 8 Afghanistan 1987 11820990309. 11.8
## 9 Afghanistan 1992 10595901589. 10.6
## 10 Afghanistan 1997 14121995875. 14.1
## # ℹ 1,694 more rows
We can use any R function within mutate():
# Create rounded and logged versions of population
pop <- gapminder %>%
mutate(
pop_million = round(pop / 1e6, 1), # Population in millions, rounded to 1 decimal
pop_log = log(pop) # Natural log of population
) %>%
select(country, year, pop, pop_million, pop_log)
pop
## # A tibble: 1,704 × 5
## country year pop pop_million pop_log
## <fct> <int> <int> <dbl> <dbl>
## 1 Afghanistan 1952 8425333 8.4 15.9
## 2 Afghanistan 1957 9240934 9.2 16.0
## 3 Afghanistan 1962 10267083 10.3 16.1
## 4 Afghanistan 1967 11537966 11.5 16.3
## 5 Afghanistan 1972 13079460 13.1 16.4
## 6 Afghanistan 1977 14880372 14.9 16.5
## 7 Afghanistan 1982 12881816 12.9 16.4
## 8 Afghanistan 1987 13867957 13.9 16.4
## 9 Afghanistan 1992 16317921 16.3 16.6
## 10 Afghanistan 1997 22227415 22.2 16.9
## # ℹ 1,694 more rows
Common functions used in mutate():
round()
: Round numbers
log()
: Natural logarithm
sqrt()
: Square root
abs()
: Absolute value
Often, we want to create categories based on values:
gpc <- gapminder %>%
mutate(
development_level = case_when(
gdpPercap < 1000 ~ "Low income",
gdpPercap < 10000 ~ "Middle income",
TRUE ~ "High income" # Default case
)
) %>%
select(country, year, gdpPercap, development_level)
gpc
## # A tibble: 1,704 × 4
## country year gdpPercap development_level
## <fct> <int> <dbl> <chr>
## 1 Afghanistan 1952 779. Low income
## 2 Afghanistan 1957 821. Low income
## 3 Afghanistan 1962 853. Low income
## 4 Afghanistan 1967 836. Low income
## 5 Afghanistan 1972 740. Low income
## 6 Afghanistan 1977 786. Low income
## 7 Afghanistan 1982 978. Low income
## 8 Afghanistan 1987 852. Low income
## 9 Afghanistan 1992 649. Low income
## 10 Afghanistan 1997 635. Low income
## # ℹ 1,694 more rows
case_when()
is powerful because:
It can handle multiple conditions
Conditions are checked in order
You can set a default with TRUE
It’s more readable than nested if-else
While we’ll dive deeper into descriptive statistics next session, let’s start our journey and look at some basic summaries.
The simplest summary is counting:
## # A tibble: 5 × 2
## continent n
## <fct> <int>
## 1 Africa 624
## 2 Asia 396
## 3 Europe 360
## 4 Americas 300
## 5 Oceania 24
# How many countries per continent in 2007?
gapminder %>%
filter(year == 2007) %>%
count(continent, sort = TRUE)
## # A tibble: 5 × 2
## continent n
## <fct> <int>
## 1 Africa 52
## 2 Asia 33
## 3 Europe 30
## 4 Americas 25
## 5 Oceania 2
We can calculate basic statistics for our variables:
# Summary statistics for life expectancy
gapminder %>%
summarise(
mean_life = mean(lifeExp),
median_life = median(lifeExp),
min_life = min(lifeExp),
max_life = max(lifeExp),
sd_life = sd(lifeExp)
)
## # A tibble: 1 × 5
## mean_life median_life min_life max_life sd_life
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 59.5 60.7 23.6 82.6 12.9
Most often, we want summaries by groups:
# Life expectancy statistics by continent in 2007
gapminder %>%
filter(year == 2007) %>%
group_by(continent) %>%
summarise(
countries = n(),
mean_life = mean(lifeExp),
min_life = min(lifeExp),
max_life = max(lifeExp)
) %>%
arrange(desc(mean_life))
## # A tibble: 5 × 5
## continent countries mean_life min_life max_life
## <fct> <int> <dbl> <dbl> <dbl>
## 1 Oceania 2 80.7 80.2 81.2
## 2 Europe 30 77.6 71.8 81.8
## 3 Americas 25 73.6 60.9 80.7
## 4 Asia 33 70.7 43.8 82.6
## 5 Africa 52 54.8 39.6 76.4
As we conclude this introduction to R, here are key practices to remember:
Here are 5 exercises to help you practice what we covered today (using the same data and skills).
Practice creating and printing objects:
Create an object called ‘x’ and store the number 42 in it
Create an object called ‘y’ and store the text “hello” in it
Print both objects
Check if ‘x’ is numeric using is.numeric()
Check if ‘y’ is numeric using is.numeric()
Hint: Remember how we used the arrow (<-) for assignment and print() function
Using select(), create a new dataset that includes:
country
year
population
GDP per capita
Hint: This is just like what we did with life expectancy, but choosing different variables
Filter the gapminder dataset to show:
Only data from the year 1997
Only countries from Africa
Only these two variables: country and population
Hint: Remember how we filtered for 2007 and Europe? Just change those values
Create a new dataset that:
Starts with data from 2007
Creates a new column that converts population to millions
Shows only: country, continent, population, and your new population in millions column
Hint: Look at how we converted GDP to billions, but use different math for millions
Count how many countries are in each continent in the year 2007.
Hint: Remember how we used count() with continent
For your weekly diary reflection, consider:
Which exercises could you complete successfully?
Where did you get stuck?
What helped you overcome any challenges?
What would you like to practice more?
Remember: The goal is learning! Share your experiences, ask questions, and help others when you can.