Getting Started with R

Milkias Z. SEMEREAB

GEOL0097-Geostatistics, University of Liège

September 29, 2022

Arithmetic with R

In its most basic form, R can be used as a simple calculator. R has the following arithmetic operators:

Addition: +
Subtraction: -
Multiplication: *
Division: /
Exponentiation: ^
Modulo: %%

The last two might need some explaining: The ^ operator raises the number to its left to the power of the number to its right: for example 3^2 is 9. The modulo returns the remainder of the division of the number to the left by the number on its right, for example 5 modulo 3 or 5 %% 3 is 2.

# An addition
5 + 4

## [1] 9

# A subtraction
5 - 5

## [1] 0

# A multiplication
3 * 5

## [1] 15

 # A division
(5 + 5) / 2

## [1] 5

# Exponentiation
2 ^ 5

## [1] 32

# Modulo
28 %% 6

## [1] 4

Variable Assignment

A basic concept in (statistical) programming is called a variable.

A variable allows you to store a value (e.g. 4) or an object (e.g. a function description) in R. You can then later use this variable’s name to easily access the value or the object that is stored within this variable.

You can assign a value 4 to a variable my_var with the command:

# Assign the value 42 to x
x <- 42

# Print out the value of the variable x
print (x)

## [1] 42

# "=" can also be used as assignment operator
y = 54 

print(y)

## [1] 54

Basic data types in R

R works with numerous data types. Some of the most basic types to get started are:

Decimal values like 4.5 are called numerics.
Whole numbers like 4 are called integers. Integers are also numerics.
Boolean values (TRUE or FALSE) are called logical.
Text (or string) values are called characters.

# create a numeric called  my_numeric with value 42
my_numeric = 42
print(my_numeric)

## [1] 42

# create a string called my_character and assign it to be "universe"
my_character = "universe"
print(my_character)

## [1] "universe"

# create a logical variable called my_logical that is FALSE
my_logical = FALSE
print(my_logical)

## [1] FALSE

Note: how the quotation marks in the editor indicate that “some text” is a string.

What’s that data type?

If we add 5 + “six”, we get an error due to a mismatch in data types. You can avoid such embarrassing situations by checking the data type of a variable beforehand. You can do this with the class() function

# Check class of my_numeric
class(my_numeric)

## [1] "numeric"

# Check class of my_character
class(my_character)

## [1] "character"

# Check class of my_logical
class(my_logical)

## [1] "logical"

Vectors

Vectors are one-dimension arrays that can hold numeric data, character data, or logical data. In other words, a vector is a simple tool to store data.
In R, you create a vector with the combine function c(). You place the vector elements separated by a comma between the parentheses. For example:

numeric_vector <- c(1, 10, 49)
print(numeric_vector)

## [1]  1 10 49

character_vector <- c("a", "b", "c")
print(character_vector)

## [1] "a" "b" "c"

boolean_vector <- c(TRUE, FALSE, TRUE)
print(boolean_vector)

## [1]  TRUE FALSE  TRUE

You can give a name to the elements of a vector with the names() function. Have a look at this example:

messi_vector = c(170, 67, 35, 3)
names(messi_vector) <- c("height_cm", "weight_kg", "age", "children")
print(messi_vector)

## height_cm weight_kg       age  children 
##       170        67        35         3

To select specific elements of the vector (and later matrices, data frames, …), you can use square brackets. Between the square brackets, you indicate what elements to select. For example, to select the first and third elements of the messi_vector:

element_1 = messi_vector[1]
element_3 = messi_vector[3]
print(element_1)

## height_cm 
##       170

print(element_3)

## age 
##  35

# For vectors with names for its elements, you can also alterantively select specific elements by using name index instead of number index
messi_weight = messi_vector["weight_kg"]
messi_children = messi_vector["children"]
print(messi_weight)

## weight_kg 
##        67

print(messi_children)

## children 
##        3

# To select multiple elements from a vector, you indicate between the brackets what elements should be selected. 
messi_1_2 = messi_vector[c(1,2)]
print(messi_1_2)

## height_cm weight_kg 
##       170        67

# or also 
messi_weig_heig = messi_vector[c("height_cm", "weight_kg")]
print(messi_weig_heig)

## height_cm weight_kg 
##       170        67

# Selecting multiple elements with c(2, 3, 4) is not very convenient. Many statisticians are lazy people by nature, so they created an easier way to do this: c(2, 3, 4) can be abbreviated to 2:4, which generates a vector with all natural numbers from 2 up to 4.

messi_bio = messi_vector[1:3]
print(messi_bio)

## height_cm weight_kg       age 
##       170        67        35

Notice that the first element in a vector has index 1, not 0 as in many other programming languages.

Once you have create vectors in R, you can use them to do calculations using R built-in functions as shown in the example below:

# Create vectors to store grades of every course for Thomas and Victoria
grades_thomas <- c(14, 15, 16, 7, 14, 9)
grades_victoria <- c(12, 8, 17, 15, 15, 13)
names(grades_thomas) = c("math", "english", "physics", "chemistry", "IT", "biology")
names(grades_victoria) = c("math", "english", "physics", "chemistry", "IT", "biology")

# display the two grade vectors 
print(grades_thomas)

##      math   english   physics chemistry        IT   biology 
##        14        15        16         7        14         9

print(grades_victoria)

##      math   english   physics chemistry        IT   biology 
##        12         8        17        15        15        13

# Take the sum of the two vectors (element-wise addition) and assign it to new variables
total_grade <- grades_thomas + grades_victoria
print (total_grade)

##      math   english   physics chemistry        IT   biology 
##        26        23        33        22        29        22

# calculate the sum of  grades for thomas using the sum() function 
total_thomas = sum(grades_thomas)
print(total_thomas)

## [1] 75

# calculate the sum of  grades for victoria using the sum() function
total_victoria = sum(grades_victoria)
print(total_victoria)

## [1] 80

# Find  the maximum grade earned by Thomas using the max() function
thomas_max = max(grades_thomas)
print(thomas_max)

## [1] 16

# Find  the minimum grade earned by Thomas using the min() function
thomas_min = min(grades_thomas)
print(thomas_min)

## [1] 7

## Find  the average grade earned by Thomas using the mean() function
thomas_avg = mean(grades_thomas)
print(thomas_avg)

## [1] 12.5

Selection by comparison

The (logical) comparison operators known to R are:

< for less than
> for greater than
<= for less than or equal to
>= for greater than or equal to
== for equal to each other
!= not equal to each other

As seen in the previous chapter, stating 6 > 5 returns TRUE. The nice thing about R is that you can use these comparison operators also on vectors. For example:

# Which courses have Thomas and Victoria failed ? (passing grade = 10)
grades_thomas >= 10

##      math   english   physics chemistry        IT   biology 
##      TRUE      TRUE      TRUE     FALSE      TRUE     FALSE

grades_victoria >= 10

##      math   english   physics chemistry        IT   biology 
##      TRUE     FALSE      TRUE      TRUE      TRUE      TRUE

# Which courses have Thomas and Victoria got excellent ? (excellent > = 17)

grades_thomas >= 17

##      math   english   physics chemistry        IT   biology 
##     FALSE     FALSE     FALSE     FALSE     FALSE     FALSE

grades_victoria >= 17

##      math   english   physics chemistry        IT   biology 
##     FALSE     FALSE      TRUE     FALSE     FALSE     FALSE

# You can use the output of the comparison operators and put it inside [] to select the elements which return TRUE. Example: Which courses have Thomas and Victoria failed ?  
grades_thomas[grades_thomas < 10]

## chemistry   biology 
##         7         9

grades_victoria[grades_victoria < 10]

## english 
##       8

Excercise: After one week in Las Vegas and played poker and roulette, you decide that it is time to start using your data analytical superpowers.

Before doing a first analysis, you decide to first collect all the winnings and losses for the last week:

For poker_vector:

On Monday you won $140
Tuesday you lost $50
Wednesday you won $20
Thursday you lost $120
Friday you won $240

For roulette_vector:

On Monday you lost $24
Tuesday you lost $50
Wednesday you won $100
Thursday you lost $350
Friday you won $10

Task 1: To be able to use this data in R, create vectors for poker and roulette and store them into variables named poker_vector and roulette_vector from Monday to Friday.

Task 2: Assign days (Monday, Tuesday, Wednesday, Thursday & Friday) as names of elements for poker_vector and roulette_vector. Then print out the two vectors.

Task 3: How much you won/lost on each day? Assign the value to variable called total_daily.

Task 4: Calculate the total amount of money that you have won/lost with poker and assign to the variable total_poker. Do the same for roulette and assign it to total_roulette. Print out both variables.

Task 5: Now that you have the totals for roulette and poker, you can easily calculate the sum of all gains and losses of the week and assign it as total_week. Print out total_week.

Task 6: Assign the poker results of Wednesday to the variable poker_wednesday.

Task 7: Assign the poker results of Tuesday, Wednesday and Thursday to the variable poker_midweek.

Task 8: Assign to roulette_selection_vector the roulette results from Tuesday up to Friday. Make use of colon (:) for vector subsetting.

Task 9: Select the first three elements in poker_vector by using their names: “Monday”, “Tuesday” and “Wednesday”. Assign the result of the selection to poker_start. Then calculate the average of the values in poker_start with the. Print out the result so you can inspect it.

Task 10: Check which elements in poker_vector are positive (i.e. > 0) and assign this to selection_vector.Print out selection_vector so you can inspect it. The printout tells you whether you won (TRUE) or lost (FALSE) any money for each day. Then Use selection_vector in square brackets to assign the amounts that you won on the profitable days to the variable poker_winning_days.

Matrix

In R, a matrix is a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns. Since you are only working with rows and columns, a matrix is called two-dimensional.

You can construct a matrix in R with the matrix() function. Consider the following example:

matrix(1:9, byrow = TRUE, nrow = 3)

##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

In the matrix() function:

The first argument is the collection of elements that R will arrange into the rows and columns of the matrix. Here, we use 1:9 which is a shortcut for c(1, 2, 3, 4, 5, 6, 7, 8, 9).
The argument byrow indicates that the matrix is filled by the rows. If we want the matrix to be filled by the columns, we just place byrow = FALSE.
The third argument nrow indicates that the matrix should have three rows.

In the following exercises you will analyze the box office numbers of the Star Wars franchise. Below three vectors are defined. Each one represents the box office numbers from the first three Star Wars movies. The first element of each vector indicates the US box office revenue, the second element refers to the Non-US box office (source: Wikipedia).

# Box office Star Wars (in millions!)
new_hope <- c(460.998, 314.4)
empire_strikes <- c(290.475, 247.900)
return_jedi <- c(309.306, 165.8)

# Use c() function to combine the three vectors into one vector. Call this vector box_office.
box_office <- c(new_hope, empire_strikes, return_jedi)

# Construct star_wars_matrix
star_wars_matrix <- matrix(box_office, byrow=TRUE, nrow=3)
print(star_wars_matrix)

##         [,1]  [,2]
## [1,] 460.998 314.4
## [2,] 290.475 247.9
## [3,] 309.306 165.8

To help you remember what is stored in star_wars_matrix, you would like to add the names of the movies for the rows. Not only does this help you to read the data, but it is also useful to select certain elements from the matrix.

Similar to vectors, you can add names for the rows and the columns of a matrix using the functions colnames() and rownames().

# Vectors region and titles.You will need these vectors to name the columns and rows of star_wars_matrix respectively. 
region <- c("US", "non-US")
titles <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")

# Name the columns with region
colnames(star_wars_matrix) = region

# Name the rows with titles
rownames(star_wars_matrix) = titles

# Print out star_wars_matrix
print(star_wars_matrix)

##                              US non-US
## A New Hope              460.998  314.4
## The Empire Strikes Back 290.475  247.9
## Return of the Jedi      309.306  165.8

To calculate the total box office revenue for the three Star Wars movies, you have to take the sum of the US revenue column and the non-US revenue column.

In R, the function rowSums() conveniently calculates the totals for each row of a matrix. This function creates a new vector:

# Calculate worldwide box office figures
worldwide_vector <- rowSums(star_wars_matrix)
print(worldwide_vector)

##              A New Hope The Empire Strikes Back      Return of the Jedi 
##                 775.398                 538.375                 475.106

Just like rowSums(), we have another built-in R function that calculates the sum of a column. The colSums() function takes a matrix and returns the sum of the columns.

# Total revenue for US and non-US
total_revenue_vector <- colSums(star_wars_matrix)

# Print out total_revenue_vector
print(total_revenue_vector)

##       US   non-US 
## 1060.779  728.100

Selection of matrix elements: Similar to vectors, you can use the square brackets [ ] to select one or multiple elements from a matrix. Whereas vectors have one dimension, matrices have two dimensions. You should therefore use a comma to separate the rows you want to select from the columns. For example:

my_matrix[1,2] selects the element at the first row and second column.
my_matrix[1:3,2:4] results in a matrix with the data on the rows 1, 2, 3 and columns 2, 3, 4.

If you want to select all elements of a row or a column, no number is needed before or after the comma, respectively:

my_matrix[,1] selects all elements of the first column.
my_matrix[1,] selects all elements of the first row.

# Select the non-US revenue for all movies
non_us_all <- star_wars_matrix[,2]
print(non_us_all)

##              A New Hope The Empire Strikes Back      Return of the Jedi 
##                   314.4                   247.9                   165.8

# Average non-US revenue
mean(non_us_all)

## [1] 242.7

# Select the non-US revenue for first two movies
non_us_some <- star_wars_matrix[1:2,2]
print(non_us_some)

##              A New Hope The Empire Strikes Back 
##                   314.4                   247.9

# Average non-US revenue for first two movies
mean(non_us_some)

## [1] 281.15

An arithmetic with matrices: Similar to what we have learned with vectors, the standard operators like +, -, /, *, etc. work in an element-wise way on matrices in R.

For example, 2 * my_matrix multiplies each element of my_matrix by two.

Assume that the price of a ticket was 5 dollars. Simply dividing the box office numbers by this ticket price gives you the number of visitors.

# Estimate the visitors
visitors <- star_wars_matrix / 5
  
# Print the estimate to the console
print(visitors)

##                              US non-US
## A New Hope              92.1996  62.88
## The Empire Strikes Back 58.0950  49.58
## Return of the Jedi      61.8612  33.16

# From the visitors matrix, select the entire first column, representing the number of visitors in the US. Store this selection as us_visitors.
us_visitors <- visitors[,1]
print(us_visitors)

##              A New Hope The Empire Strikes Back      Return of the Jedi 
##                 92.1996                 58.0950                 61.8612

# Calculate the average number of US visitors.;
mean(us_visitors)

## [1] 70.7186

Matrix

Most datasets you will be working with will be stored as data frames. Remember that all the elements that you put in a matrix should be of the same type. Back then, your dataset on Star Wars only contained numeric elements.

When doing a market research survey, however, you often have questions such as:

‘Are you married?’ or ‘yes/no’ questions (logical)
‘How old are you?’ (numeric)
‘What is your opinion on this product?’ or other ‘open-ended’ questions (character)
etc…

The output, namely the respondents’ answers to the questions formulated above, is a dataset of different data types. You will often find yourself working with datasets that contain different data types instead of only one.

A data frame has the variables of a dataset as columns and the observations as rows.

The mtcars dataset is a built-in dataset in R that contains measurements on 11 different attributes for 32 different cars. We will analyze the dataset below:

# What is the type of the data type ? use the class() function
class(mtcars)

## [1] "data.frame"

# Print out built-in R data frame "mtcars" 
mtcars

Wow, that is a lot of cars!

Working with large datasets is very common in data analysis. When you work with large datasets and data frames, your first task as a statistician is to develop a clear understanding of its structure and main elements. Therefore, it is often useful to show only a small part of the entire dataset.

The function head() enables you to show the first observations of a data frame.

# Call head() on mtcars
head(mtcars)

Another method that is often used to get a rapid overview of your data is the function str(). The function str() shows you the structure of your dataset:

# Investigate the structure of mtcars
str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Selection of data frame elements: Similar to vectors and matrices, you select elements from a data frame with the help of square brackets [ ]. By using a comma, you can indicate what to select from the rows and the columns respectively. For example:

my_df[1,2] selects the value at the first row and second column in my_df.
my_df[1:3,2:4] selects rows 1, 2, 3 and columns 2, 3, 4 in my_df.

Sometimes you want to select all elements of a row or column. For example, my_df[1, ] selects all elements of the first row.

# Print out horse power (hp) of the Mazda RX4 (row 1, column 4)
mtcars[1,3]

## [1] 160

# Print out data for Datsun 710 (entire third row)
mtcars[3,]

Instead of using numerics to select elements of a data frame, you can also use the variable names to select columns of a data frame.

# Select the number of cylinders (cyl) for the first 8 cars. 
mtcars[1:8, "cyl"]

## [1] 6 6 4 6 8 6 8 4

You will often want to select an entire column, namely one specific variable from a data frame. If you want to select all elements of the variable mpg (miles per gallon, both of these will do the trick:

mtcars[, 1]
mtcars[, “mpg”]

However, there is a short-cut. If your columns have names, you can use the $ sign:

mtcars$mpg

##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4

Summarizing dataframe: We can use the summary() function to quickly summarize each variable in the dataset:

#summarize mtcars dataset
summary(mtcars)

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Some Useful R Functions

help(): provides access to the documentation pages for R functions, data sets, and other objects
length(object): number of elements or components
str(object): structure of an object
class(object): class or type of an object
ls(): list current objects
rm(object): delete an object;