Introduction to the R Environment

=================================

Hey future data analysis! R is a powerful tool for statistical analysis and data visualization. Let’s start with the basics and set you on a journey towards becoming an R ninja.

—————

The Very Basics

—————

Move your mouse near the Run symbol where you see a plus sign with a C. Click on it, then on R.

You will see a a grey bar appear with the following symbol at the top: ```{r}

In there copy the following and run it by clicking the green play button that is at the extreme right of the box:

print(‘hell yeah this works!’)

print('hell yeah this works!')
## [1] "hell yeah this works!"

Everything that is in that box is R code. Everything outside is text.

You can test a code chunk by clicking the run button as we just did. If it does not work, it will give you an error code.

For example enter print(

(print)
## function (x, ...) 
## UseMethod("print")
## <bytecode: 0x000002419cdc2f38>
## <environment: namespace:base>

Let’s practice another skill

Want to see what it looks like?

Add some text here.

Winter

Some with two hashtags (for titles)

followed by text

Snowball

Others just straight typing. Anything you would like.

When you are ready: Go to knit, then click knit to html.

Wait, it will take some time. If it works, it will open a new window and you can click publish and create an account to be able to post html docs online for free on the Rpubs platform – you will need to become familiar with this to submit homework. For the homework, you will need to showcase both your code and provide text answering the questions (e.g., interpretation of the output).

Variables in R

—————

Variables are like named storage boxes where you can put information to use later.

Example:

my_variable <- 10

This means: “store the number 10 in a box named my_variable”.

Accessing the stored value is as simple as calling the variable’s name (look also in the right window!)

my_variable  # This should display 10.
## [1] 10

Now, let’s do some basic math.

2 + 2 
## [1] 4

Although this seems very basic for now, R will become a powerful tool to do calculations for us – including when we have thousands of observations!

But R can do way more than that!

——————–

Loading Libraries

——————–

Libraries in R are like apps on your phone. They provide additional functionality.

But before we use them, we need to install and then load them.

We will continue to update what we need

The following chunk of code allows you to just insert in the list the packages you need for your project. For each tutorial, we will do this as Step 1 and then load our data. When you work, you can just add any package you need (after, e.g., “gapminder”) and run it.

Today, we only need the following three.

# List of packages
packages <- c("tidyverse", "gapminder", "fst") # add any you need here

# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Load the packages
lapply(packages, library, character.only = TRUE)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "gapminder" "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "fst"       "gapminder" "lubridate" "forcats"   "stringr"   "dplyr"    
##  [7] "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse"
## [13] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [19] "base"

—————————-

Dealing with Errors

—————————-

“To err is human”. If you make a mistake, R will tell you. It might say ‘unexpected symbol’ if you forgot a comma or bracket. For instance, try running: print(“Hello, World!) #unhastag to see the error (note: when you do so, you should see a big red X on the left-hand side) See the error? You forgot the closing double quote.

Here’s a common mistake beginners make: Using a function without loading its package. Uncomment the next line and run it. select(gapminder, country, year) # This should throw an error as the function ‘select()’ is from the dplyr package.

The fix? Make sure the tidyverse package is loaded.

If a package is not loaded, then it’s like not having it at all. If it’s loaded, then you can use the functions. So if you get an error that says ‘Could not find function X’ the fix is as simple as finding out which package contains that function and make sure it’s loaded.

library(tidyverse)

The tidyverse is a collection of R packages that share common data representations and API design. This collection is particularly handy for data science tasks.

And always remember: An error is not the end; it’s just feedback. Googling the error message usually helps!

Another really useful trick: doing a ? in front of the function that is prompting an error

A VERY COMMON ISSUE is to have a small mistake in the syntax

For example, forgetting a “,” or improperly calling a function or not referring to variables or datasets properly

SO, always double check what you wrote after you got an error and do not just re-run it!

for example, all of the following will yield errors (meaning, it won’t work/run) select(gapminder country, year) –> forgot a comma after gapminder selct(gapminder, country, year) –> forgot an e in select select(gpminder, country, year) –> forgor an a in gapminder

————————-

Exploring a first dataset

————————-

First, let’s get an overview of the ‘gapminder’ dataset

head(gapminder)  # This shows the first few rows of the dataset
## # A tibble: 6 × 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

Want to know how big the dataset is?

dim(gapminder)  # Shows number of rows and columns
## [1] 1704    6

Or the number of rows

nrow(gapminder)
## [1] 1704

Number of columns

ncol(gapminder)
## [1] 6

Curious about the names of the columns (variables)?

names(gapminder)  # warning : not super useful if the dataset has many variables (i.e., columns)
## [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"

Now, let’s actually look at it!

view(gapminder)

Checking if variables are categorical or numerical

is.factor(gapminder$country) #Checks if a variable is categorical (factor).
## [1] TRUE
is.numeric(gapminder$country) #Checks if a variable is numerical
## [1] FALSE

For categorical (ordinal would have a clear order, nominal does not)

levels(gapminder$continent)  
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

Another way

sapply(gapminder, class)
##   country continent      year   lifeExp       pop gdpPercap 
##  "factor"  "factor" "integer" "numeric" "integer" "numeric"

The pipe operator (%>%)

———————–

The pipe operator, %>%, allows you to chain operations, making the code more readable and intuitive.

For example, let’s say you want to:

  1. Filter the ‘gapminder’ dataset for the year 2007.
  2. Extract data only for countries in Asia.
  3. Calculate the mean life expectancy.

Without pipe:

mean(subset(gapminder, continent == "Asia" & year == 2007)$lifeExp, na.rm = TRUE)
## [1] 70.72848

With pipe:

gapminder %>%
  filter(year == 2007, continent == "Asia") %>%
  summarize(Average = mean(lifeExp, na.rm = TRUE))
## # A tibble: 1 × 1
##   Average
##     <dbl>
## 1    70.7

As you can see, the version with the pipe operator is more intuitive, reading almost like “sentences” you can read :

e.g., I want to “filter” (the function) to the year 2007, and the continent of Asia. Then, I want to “summarize” the column lifeExp and call it “Average”.

Each step is clearly sequenced, making it easier to understand and modify.

Now let’s a tour of important functions and operators in the tidyverse

— R Functions and Operators Overview —

Let’s dive deep into some of the most commonly used functions from the “dplyr” package in R’s tidyverse.

—- 1. filter() Function —-

The filter() function is used to select rows that meet certain criteria.

Using filter() to select rows where the year is 2007:

gapminder %>%
  filter(year == 2007)
## # A tibble: 142 × 6
##    country     continent  year lifeExp       pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Afghanistan Asia       2007    43.8  31889923      975.
##  2 Albania     Europe     2007    76.4   3600523     5937.
##  3 Algeria     Africa     2007    72.3  33333216     6223.
##  4 Angola      Africa     2007    42.7  12420476     4797.
##  5 Argentina   Americas   2007    75.3  40301927    12779.
##  6 Australia   Oceania    2007    81.2  20434176    34435.
##  7 Austria     Europe     2007    79.8   8199783    36126.
##  8 Bahrain     Asia       2007    75.6    708573    29796.
##  9 Bangladesh  Asia       2007    64.1 150448339     1391.
## 10 Belgium     Europe     2007    79.4  10392226    33693.
## # ℹ 132 more rows

Using multiple conditions:

Here, the ‘&’ operator means “AND”. This will select rows where the year is 2007 AND the continent is Asia.

gapminder %>%
  filter(year == 2007, continent == "Asia")
## # A tibble: 33 × 6
##    country          continent  year lifeExp        pop gdpPercap
##    <fct>            <fct>     <int>   <dbl>      <int>     <dbl>
##  1 Afghanistan      Asia       2007    43.8   31889923      975.
##  2 Bahrain          Asia       2007    75.6     708573    29796.
##  3 Bangladesh       Asia       2007    64.1  150448339     1391.
##  4 Cambodia         Asia       2007    59.7   14131858     1714.
##  5 China            Asia       2007    73.0 1318683096     4959.
##  6 Hong Kong, China Asia       2007    82.2    6980412    39725.
##  7 India            Asia       2007    64.7 1110396331     2452.
##  8 Indonesia        Asia       2007    70.6  223547000     3541.
##  9 Iran             Asia       2007    71.0   69453570    11606.
## 10 Iraq             Asia       2007    59.5   27499638     4471.
## # ℹ 23 more rows

—- 2. select() Function —-

The select() function is used to select specific columns.

Selecting the ‘country’ and ‘year’ columns:

gapminder %>%
  select(country, year)
## # A tibble: 1,704 × 2
##    country      year
##    <fct>       <int>
##  1 Afghanistan  1952
##  2 Afghanistan  1957
##  3 Afghanistan  1962
##  4 Afghanistan  1967
##  5 Afghanistan  1972
##  6 Afghanistan  1977
##  7 Afghanistan  1982
##  8 Afghanistan  1987
##  9 Afghanistan  1992
## 10 Afghanistan  1997
## # ℹ 1,694 more rows

Selecting based on a range of variables:

This selects all columns between ‘country’ and ‘year’.

gapminder %>%
  select(country:year)
## # A tibble: 1,704 × 3
##    country     continent  year
##    <fct>       <fct>     <int>
##  1 Afghanistan Asia       1952
##  2 Afghanistan Asia       1957
##  3 Afghanistan Asia       1962
##  4 Afghanistan Asia       1967
##  5 Afghanistan Asia       1972
##  6 Afghanistan Asia       1977
##  7 Afghanistan Asia       1982
##  8 Afghanistan Asia       1987
##  9 Afghanistan Asia       1992
## 10 Afghanistan Asia       1997
## # ℹ 1,694 more rows

—- 3. rename() Function —-

The rename() function renames existing columns.

gapminder %>%
  rename(population = pop)
## # A tibble: 1,704 × 6
##    country     continent  year lifeExp population gdpPercap
##    <fct>       <fct>     <int>   <dbl>      <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8    8425333      779.
##  2 Afghanistan Asia       1957    30.3    9240934      821.
##  3 Afghanistan Asia       1962    32.0   10267083      853.
##  4 Afghanistan Asia       1967    34.0   11537966      836.
##  5 Afghanistan Asia       1972    36.1   13079460      740.
##  6 Afghanistan Asia       1977    38.4   14880372      786.
##  7 Afghanistan Asia       1982    39.9   12881816      978.
##  8 Afghanistan Asia       1987    40.8   13867957      852.
##  9 Afghanistan Asia       1992    41.7   16317921      649.
## 10 Afghanistan Asia       1997    41.8   22227415      635.
## # ℹ 1,694 more rows

—- 4. mutate() Function —-

The mutate() function creates or modifies columns based on existing columns.

Creating a new column ‘total_gdp’ that is a product of ‘pop’ and ‘gdpPercap’:

gapminder %>%
  mutate(total_gdp = pop * gdpPercap)
## # A tibble: 1,704 × 7
##    country     continent  year lifeExp      pop gdpPercap    total_gdp
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>        <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.  6567086330.
##  2 Afghanistan Asia       1957    30.3  9240934      821.  7585448670.
##  3 Afghanistan Asia       1962    32.0 10267083      853.  8758855797.
##  4 Afghanistan Asia       1967    34.0 11537966      836.  9648014150.
##  5 Afghanistan Asia       1972    36.1 13079460      740.  9678553274.
##  6 Afghanistan Asia       1977    38.4 14880372      786. 11697659231.
##  7 Afghanistan Asia       1982    39.9 12881816      978. 12598563401.
##  8 Afghanistan Asia       1987    40.8 13867957      852. 11820990309.
##  9 Afghanistan Asia       1992    41.7 16317921      649. 10595901589.
## 10 Afghanistan Asia       1997    41.8 22227415      635. 14121995875.
## # ℹ 1,694 more rows

—- 5. if_else() Function —-

The if_else() function returns a value based on a given condition.

Creating a new column ‘population_size’ that categorizes countries based on their population:

gapminder %>%
  mutate(population_size = if_else(pop > 1e6, "large", "small"))
## # A tibble: 1,704 × 7
##    country     continent  year lifeExp      pop gdpPercap population_size
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl> <chr>          
##  1 Afghanistan Asia       1952    28.8  8425333      779. large          
##  2 Afghanistan Asia       1957    30.3  9240934      821. large          
##  3 Afghanistan Asia       1962    32.0 10267083      853. large          
##  4 Afghanistan Asia       1967    34.0 11537966      836. large          
##  5 Afghanistan Asia       1972    36.1 13079460      740. large          
##  6 Afghanistan Asia       1977    38.4 14880372      786. large          
##  7 Afghanistan Asia       1982    39.9 12881816      978. large          
##  8 Afghanistan Asia       1987    40.8 13867957      852. large          
##  9 Afghanistan Asia       1992    41.7 16317921      649. large          
## 10 Afghanistan Asia       1997    41.8 22227415      635. large          
## # ℹ 1,694 more rows

—- 6. group_by() Function —-

The group_by() function divides the data into groups based on one or more variables.

Grouping data by ‘continent’:

gapminder %>%
  group_by(continent) %>%
  summarize(mean_lifeExp = mean(lifeExp, na.rm = TRUE))
## # A tibble: 5 × 2
##   continent mean_lifeExp
##   <fct>            <dbl>
## 1 Africa            48.9
## 2 Americas          64.7
## 3 Asia              60.1
## 4 Europe            71.9
## 5 Oceania           74.3

The above command groups the data by continent and calculates the average life expectancy for each continent.

Checking data structure

———————–

Often, you would want to check the structure of your dataset to understand its columns and types.

str(gapminder)
## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

———————-

Renaming dataset

———————-

Want to rename the dataset itself or store as a new name? Simple as using <-

my_data <- gapminder

It’s really useful as you do transformations or mutations on your data

You can refer to it based on what you change so you know which one, for example, if the “raw_data” and which is “cleaned” or “filtered”. You can also just have an intuitive name based on what you are looking at, say if you are subsetting to just a few variables.

We will do so in future tutorials.

—————–

Basic Plotting

—————–

Before diving deep into plotting, let’s load another package to make our plots prettier.

Scatter plot of GDP per capita vs. Life Expectancy

ggplot(my_data, aes(x = gdpPercap, y = lifeExp)) + 
  geom_point(aes(color = continent), alpha = 0.5) + 
  labs(title = "Life Expectancy vs. GDP per Capita", 
       x = "GDP per Capita", 
       y = "Life Expectancy", 
       color = "Continent") + 
  theme_minimal()

Congrats! You made your first plot in R.

We can certainly do better in terms of visualization, but we will cover this in future tutorials!

For example, simple log scale can already help us visualize better. We will get to principles of data visualizations in future sessions & tutorials.

ggplot(my_data, aes(x = log(gdpPercap), y = lifeExp)) + 
  geom_point(aes(color = continent), alpha = 0.5, size = 2) + 
  scale_x_continuous(trans='log10', labels = scales::comma) +
  labs(title = "Life Expectancy vs. GDP per Capita (Log-scaled)", 
       x = "GDP per Capita (Log-scaled)", 
       y = "Life Expectancy", 
       color = "Continent") + 
  theme_minimal()

—————————————

Now let’s turn to our class dataset

—————————————

—————————————————————————–

Introduction to the European Social Survey (ESS) dataset:

—————————————————————————–

The European Social Survey (ESS) is a biennial multi-country survey covering over 30 nations. It delves into diverse topics such as media use, social and public trust, political interest, subjective well-being, and more. Think of it as a treasure trove of information on European societies!

—————————————————————————–

Loading Data into R:

—————————————————————————–

There are typically multiple ways to load data into R. For example, if we had a .csv file:

data <- read.csv(“path_to_file.csv”)

However, we need to use a different method since the dataset is quite large and the .csv file would be too large for many laptops.

So we will use a package that allows to compress the file (note, you only need to use the read_fst as shown below).

However, AND THIS IS VERY IMPORTANT, you need to create a folder for the course where you save your markdown file and the class dataset (which you download from the Quercus course website).

To double check in which folder R thinks you’re working from use the following command:

getwd()
## [1] "C:/Users/rayac/Downloads"

Make sure the folder path specified when you run the command contains the class dataset (which is “All-ESS-Data.fst”)

Now, let’s load the course dataset

Voilà we have our data!

From what we covered today, can you tell us how many variables and observations there are?

To explore further before the next tutorial where we will really start working with the ESS data, check out their website:

https://ess.sikt.no/en/?tab=overview

That’s all for today!