Hey future data analysis! R is a powerful tool for statistical analysis and data visualization. Let’s start with the basics and set you on a journey towards becoming an R ninja.
Move your mouse near the Run symbol where you see a plus sign with a C. Click on it, then on R.
You will see a a grey bar appear with the following symbol at the top: ```{r}
In there copy the following and run it by clicking the green play button that is at the extreme right of the box:
print(‘hell yeah this works!’)
print('hell yeah this works!')
## [1] "hell yeah this works!"
Everything that is in that box is R code. Everything outside is text.
You can test a code chunk by clicking the run button as we just did. If it does not work, it will give you an error code.
For example enter print(
(print)
## function (x, ...)
## UseMethod("print")
## <bytecode: 0x000002419cdc2f38>
## <environment: namespace:base>
Want to see what it looks like?
Add some text here.
Winter
Some with two hashtags (for titles)
followed by text
Others just straight typing. Anything you would like.
When you are ready: Go to knit, then click knit to html.
Wait, it will take some time. If it works, it will open a new window and you can click publish and create an account to be able to post html docs online for free on the Rpubs platform – you will need to become familiar with this to submit homework. For the homework, you will need to showcase both your code and provide text answering the questions (e.g., interpretation of the output).
Variables are like named storage boxes where you can put information to use later.
Example:
my_variable <- 10
This means: “store the number 10 in a box named my_variable”.
Accessing the stored value is as simple as calling the variable’s name (look also in the right window!)
my_variable # This should display 10.
## [1] 10
Now, let’s do some basic math.
2 + 2
## [1] 4
Although this seems very basic for now, R will become a powerful tool to do calculations for us – including when we have thousands of observations!
But R can do way more than that!
Libraries in R are like apps on your phone. They provide additional functionality.
But before we use them, we need to install and then load them.
The following chunk of code allows you to just insert in the list the packages you need for your project. For each tutorial, we will do this as Step 1 and then load our data. When you work, you can just add any package you need (after, e.g., “gapminder”) and run it.
Today, we only need the following three.
# List of packages
packages <- c("tidyverse", "gapminder", "fst") # add any you need here
# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
# Load the packages
lapply(packages, library, character.only = TRUE)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## [[1]]
## [1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
## [7] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods" "base"
##
## [[2]]
## [1] "gapminder" "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [7] "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [13] "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[3]]
## [1] "fst" "gapminder" "lubridate" "forcats" "stringr" "dplyr"
## [7] "purrr" "readr" "tidyr" "tibble" "ggplot2" "tidyverse"
## [13] "stats" "graphics" "grDevices" "utils" "datasets" "methods"
## [19] "base"
“To err is human”. If you make a mistake, R will tell you. It might say ‘unexpected symbol’ if you forgot a comma or bracket. For instance, try running: print(“Hello, World!) #unhastag to see the error (note: when you do so, you should see a big red X on the left-hand side) See the error? You forgot the closing double quote.
Here’s a common mistake beginners make: Using a function without
loading its package. Uncomment the next line and run it.
select(gapminder, country, year) # This should throw an error as the
function ‘select()’ is from the dplyr
package.
The fix? Make sure the tidyverse
package is loaded.
If a package is not loaded, then it’s like not having it at all. If it’s loaded, then you can use the functions. So if you get an error that says ‘Could not find function X’ the fix is as simple as finding out which package contains that function and make sure it’s loaded.
library(tidyverse)
The tidyverse is a collection of R packages that share common data representations and API design. This collection is particularly handy for data science tasks.
And always remember: An error is not the end; it’s just feedback. Googling the error message usually helps!
Another really useful trick: doing a ? in front of the function that is prompting an error
A VERY COMMON ISSUE is to have a small mistake in the syntax
For example, forgetting a “,” or improperly calling a function or not referring to variables or datasets properly
SO, always double check what you wrote after you got an error and do not just re-run it!
for example, all of the following will yield errors (meaning, it won’t work/run) select(gapminder country, year) –> forgot a comma after gapminder selct(gapminder, country, year) –> forgot an e in select select(gpminder, country, year) –> forgor an a in gapminder
First, let’s get an overview of the ‘gapminder’ dataset
head(gapminder) # This shows the first few rows of the dataset
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
Want to know how big the dataset is?
dim(gapminder) # Shows number of rows and columns
## [1] 1704 6
Or the number of rows
nrow(gapminder)
## [1] 1704
Number of columns
ncol(gapminder)
## [1] 6
Curious about the names of the columns (variables)?
names(gapminder) # warning : not super useful if the dataset has many variables (i.e., columns)
## [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
Now, let’s actually look at it!
view(gapminder)
is.factor(gapminder$country) #Checks if a variable is categorical (factor).
## [1] TRUE
is.numeric(gapminder$country) #Checks if a variable is numerical
## [1] FALSE
levels(gapminder$continent)
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
sapply(gapminder, class)
## country continent year lifeExp pop gdpPercap
## "factor" "factor" "integer" "numeric" "integer" "numeric"
The pipe operator, %>%
, allows you to chain
operations, making the code more readable and intuitive.
For example, let’s say you want to:
mean(subset(gapminder, continent == "Asia" & year == 2007)$lifeExp, na.rm = TRUE)
## [1] 70.72848
gapminder %>%
filter(year == 2007, continent == "Asia") %>%
summarize(Average = mean(lifeExp, na.rm = TRUE))
## # A tibble: 1 × 1
## Average
## <dbl>
## 1 70.7
As you can see, the version with the pipe operator is more intuitive, reading almost like “sentences” you can read :
e.g., I want to “filter” (the function) to the year 2007, and the continent of Asia. Then, I want to “summarize” the column lifeExp and call it “Average”.
Each step is clearly sequenced, making it easier to understand and modify.
Now let’s a tour of important functions and operators in the tidyverse
Let’s dive deep into some of the most commonly used functions from the “dplyr” package in R’s tidyverse.
The filter()
function is used to select rows that meet
certain criteria.
Using filter() to select rows where the year
is
2007:
gapminder %>%
filter(year == 2007)
## # A tibble: 142 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2007 43.8 31889923 975.
## 2 Albania Europe 2007 76.4 3600523 5937.
## 3 Algeria Africa 2007 72.3 33333216 6223.
## 4 Angola Africa 2007 42.7 12420476 4797.
## 5 Argentina Americas 2007 75.3 40301927 12779.
## 6 Australia Oceania 2007 81.2 20434176 34435.
## 7 Austria Europe 2007 79.8 8199783 36126.
## 8 Bahrain Asia 2007 75.6 708573 29796.
## 9 Bangladesh Asia 2007 64.1 150448339 1391.
## 10 Belgium Europe 2007 79.4 10392226 33693.
## # ℹ 132 more rows
Using multiple conditions:
Here, the ‘&’ operator means “AND”. This will select rows where the year is 2007 AND the continent is Asia.
gapminder %>%
filter(year == 2007, continent == "Asia")
## # A tibble: 33 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2007 43.8 31889923 975.
## 2 Bahrain Asia 2007 75.6 708573 29796.
## 3 Bangladesh Asia 2007 64.1 150448339 1391.
## 4 Cambodia Asia 2007 59.7 14131858 1714.
## 5 China Asia 2007 73.0 1318683096 4959.
## 6 Hong Kong, China Asia 2007 82.2 6980412 39725.
## 7 India Asia 2007 64.7 1110396331 2452.
## 8 Indonesia Asia 2007 70.6 223547000 3541.
## 9 Iran Asia 2007 71.0 69453570 11606.
## 10 Iraq Asia 2007 59.5 27499638 4471.
## # ℹ 23 more rows
The select()
function is used to select specific
columns.
Selecting the ‘country’ and ‘year’ columns:
gapminder %>%
select(country, year)
## # A tibble: 1,704 × 2
## country year
## <fct> <int>
## 1 Afghanistan 1952
## 2 Afghanistan 1957
## 3 Afghanistan 1962
## 4 Afghanistan 1967
## 5 Afghanistan 1972
## 6 Afghanistan 1977
## 7 Afghanistan 1982
## 8 Afghanistan 1987
## 9 Afghanistan 1992
## 10 Afghanistan 1997
## # ℹ 1,694 more rows
Selecting based on a range of variables:
This selects all columns between ‘country’ and ‘year’.
gapminder %>%
select(country:year)
## # A tibble: 1,704 × 3
## country continent year
## <fct> <fct> <int>
## 1 Afghanistan Asia 1952
## 2 Afghanistan Asia 1957
## 3 Afghanistan Asia 1962
## 4 Afghanistan Asia 1967
## 5 Afghanistan Asia 1972
## 6 Afghanistan Asia 1977
## 7 Afghanistan Asia 1982
## 8 Afghanistan Asia 1987
## 9 Afghanistan Asia 1992
## 10 Afghanistan Asia 1997
## # ℹ 1,694 more rows
The rename()
function renames existing columns.
gapminder %>%
rename(population = pop)
## # A tibble: 1,704 × 6
## country continent year lifeExp population gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ℹ 1,694 more rows
The mutate()
function creates or modifies columns based
on existing columns.
Creating a new column ‘total_gdp’ that is a product of ‘pop’ and ‘gdpPercap’:
gapminder %>%
mutate(total_gdp = pop * gdpPercap)
## # A tibble: 1,704 × 7
## country continent year lifeExp pop gdpPercap total_gdp
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 6567086330.
## 2 Afghanistan Asia 1957 30.3 9240934 821. 7585448670.
## 3 Afghanistan Asia 1962 32.0 10267083 853. 8758855797.
## 4 Afghanistan Asia 1967 34.0 11537966 836. 9648014150.
## 5 Afghanistan Asia 1972 36.1 13079460 740. 9678553274.
## 6 Afghanistan Asia 1977 38.4 14880372 786. 11697659231.
## 7 Afghanistan Asia 1982 39.9 12881816 978. 12598563401.
## 8 Afghanistan Asia 1987 40.8 13867957 852. 11820990309.
## 9 Afghanistan Asia 1992 41.7 16317921 649. 10595901589.
## 10 Afghanistan Asia 1997 41.8 22227415 635. 14121995875.
## # ℹ 1,694 more rows
The if_else()
function returns a value based on a given
condition.
Creating a new column ‘population_size’ that categorizes countries based on their population:
gapminder %>%
mutate(population_size = if_else(pop > 1e6, "large", "small"))
## # A tibble: 1,704 × 7
## country continent year lifeExp pop gdpPercap population_size
## <fct> <fct> <int> <dbl> <int> <dbl> <chr>
## 1 Afghanistan Asia 1952 28.8 8425333 779. large
## 2 Afghanistan Asia 1957 30.3 9240934 821. large
## 3 Afghanistan Asia 1962 32.0 10267083 853. large
## 4 Afghanistan Asia 1967 34.0 11537966 836. large
## 5 Afghanistan Asia 1972 36.1 13079460 740. large
## 6 Afghanistan Asia 1977 38.4 14880372 786. large
## 7 Afghanistan Asia 1982 39.9 12881816 978. large
## 8 Afghanistan Asia 1987 40.8 13867957 852. large
## 9 Afghanistan Asia 1992 41.7 16317921 649. large
## 10 Afghanistan Asia 1997 41.8 22227415 635. large
## # ℹ 1,694 more rows
The group_by()
function divides the data into groups
based on one or more variables.
Grouping data by ‘continent’:
gapminder %>%
group_by(continent) %>%
summarize(mean_lifeExp = mean(lifeExp, na.rm = TRUE))
## # A tibble: 5 × 2
## continent mean_lifeExp
## <fct> <dbl>
## 1 Africa 48.9
## 2 Americas 64.7
## 3 Asia 60.1
## 4 Europe 71.9
## 5 Oceania 74.3
The above command groups the data by continent and calculates the average life expectancy for each continent.
Often, you would want to check the structure of your dataset to understand its columns and types.
str(gapminder)
## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
Want to rename the dataset itself or store as a new name? Simple as using <-
my_data <- gapminder
It’s really useful as you do transformations or mutations on your data
You can refer to it based on what you change so you know which one, for example, if the “raw_data” and which is “cleaned” or “filtered”. You can also just have an intuitive name based on what you are looking at, say if you are subsetting to just a few variables.
We will do so in future tutorials.
Before diving deep into plotting, let’s load another package to make our plots prettier.
Scatter plot of GDP per capita vs. Life Expectancy
ggplot(my_data, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = continent), alpha = 0.5) +
labs(title = "Life Expectancy vs. GDP per Capita",
x = "GDP per Capita",
y = "Life Expectancy",
color = "Continent") +
theme_minimal()
Congrats! You made your first plot in R.
We can certainly do better in terms of visualization, but we will cover this in future tutorials!
For example, simple log scale can already help us visualize better. We will get to principles of data visualizations in future sessions & tutorials.
ggplot(my_data, aes(x = log(gdpPercap), y = lifeExp)) +
geom_point(aes(color = continent), alpha = 0.5, size = 2) +
scale_x_continuous(trans='log10', labels = scales::comma) +
labs(title = "Life Expectancy vs. GDP per Capita (Log-scaled)",
x = "GDP per Capita (Log-scaled)",
y = "Life Expectancy",
color = "Continent") +
theme_minimal()
The European Social Survey (ESS) is a biennial multi-country survey covering over 30 nations. It delves into diverse topics such as media use, social and public trust, political interest, subjective well-being, and more. Think of it as a treasure trove of information on European societies!
There are typically multiple ways to load data into R. For example, if we had a .csv file:
data <- read.csv(“path_to_file.csv”)
However, we need to use a different method since the dataset is quite large and the .csv file would be too large for many laptops.
So we will use a package that allows to compress the file (note, you only need to use the read_fst as shown below).
However, AND THIS IS VERY IMPORTANT, you need to create a folder for the course where you save your markdown file and the class dataset (which you download from the Quercus course website).
To double check in which folder R thinks you’re working from use the following command:
getwd()
## [1] "C:/Users/rayac/Downloads"
Make sure the folder path specified when you run the command contains the class dataset (which is “All-ESS-Data.fst”)
Voilà we have our data!
From what we covered today, can you tell us how many variables and observations there are?
To explore further before the next tutorial where we will really start working with the ESS data, check out their website:
https://ess.sikt.no/en/?tab=overview
That’s all for today!