Milkias Z. SEMEREAB
GEOL0097-Geostatistics, University of Liège
September 29, 2022
In its most basic form, R can be used as a simple calculator. R has the following arithmetic operators:
The last two might need some explaining: The ^ operator raises the number to its left to the power of the number to its right: for example 3^2 is 9. The modulo returns the remainder of the division of the number to the left by the number on its right, for example 5 modulo 3 or 5 %% 3 is 2.
# An addition
5 + 4
## [1] 9
# A subtraction
5 - 5
## [1] 0
# A multiplication
3 * 5
## [1] 15
# A division
(5 + 5) / 2
## [1] 5
# Exponentiation
2 ^ 5
## [1] 32
# Modulo
28 %% 6
## [1] 4
A basic concept in (statistical) programming is called a variable.
A variable allows you to store a value (e.g. 4) or an object (e.g. a function description) in R. You can then later use this variable’s name to easily access the value or the object that is stored within this variable.
You can assign a value 4 to a variable my_var with the command:
# Assign the value 42 to x
x <- 42
# Print out the value of the variable x
print (x)
## [1] 42
# "=" can also be used as assignment operator
y = 54
print(y)
## [1] 54
R works with numerous data types. Some of the most basic types to get started are:
# create a numeric called my_numeric with value 42
my_numeric = 42
print(my_numeric)
## [1] 42
# create a string called my_character and assign it to be "universe"
my_character = "universe"
print(my_character)
## [1] "universe"
# create a logical variable called my_logical that is FALSE
my_logical = FALSE
print(my_logical)
## [1] FALSE
Note: how the quotation marks in the editor indicate that “some text” is a string.
What’s that data type?
If we add 5 + “six”, we get an error due to a mismatch in data types. You can avoid such embarrassing situations by checking the data type of a variable beforehand. You can do this with the class() function
# Check class of my_numeric
class(my_numeric)
## [1] "numeric"
# Check class of my_character
class(my_character)
## [1] "character"
# Check class of my_logical
class(my_logical)
## [1] "logical"
Vectors are one-dimension arrays that can hold numeric data,
character data, or logical data. In other words, a vector is a simple
tool to store data.
In R, you create a vector with the combine function
c(). You place the vector elements separated by a comma between
the parentheses. For example:
numeric_vector <- c(1, 10, 49)
print(numeric_vector)
## [1] 1 10 49
character_vector <- c("a", "b", "c")
print(character_vector)
## [1] "a" "b" "c"
boolean_vector <- c(TRUE, FALSE, TRUE)
print(boolean_vector)
## [1] TRUE FALSE TRUE
You can give a name to the elements of a vector with the names() function. Have a look at this example:
messi_vector = c(170, 67, 35, 3)
names(messi_vector) <- c("height_cm", "weight_kg", "age", "children")
print(messi_vector)
## height_cm weight_kg age children
## 170 67 35 3
To select specific elements of the vector (and later matrices, data frames, …), you can use square brackets. Between the square brackets, you indicate what elements to select. For example, to select the first and third elements of the messi_vector:
element_1 = messi_vector[1]
element_3 = messi_vector[3]
print(element_1)
## height_cm
## 170
print(element_3)
## age
## 35
# For vectors with names for its elements, you can also alterantively select specific elements by using name index instead of number index
messi_weight = messi_vector["weight_kg"]
messi_children = messi_vector["children"]
print(messi_weight)
## weight_kg
## 67
print(messi_children)
## children
## 3
# To select multiple elements from a vector, you indicate between the brackets what elements should be selected.
messi_1_2 = messi_vector[c(1,2)]
print(messi_1_2)
## height_cm weight_kg
## 170 67
# or also
messi_weig_heig = messi_vector[c("height_cm", "weight_kg")]
print(messi_weig_heig)
## height_cm weight_kg
## 170 67
# Selecting multiple elements with c(2, 3, 4) is not very convenient. Many statisticians are lazy people by nature, so they created an easier way to do this: c(2, 3, 4) can be abbreviated to 2:4, which generates a vector with all natural numbers from 2 up to 4.
messi_bio = messi_vector[1:3]
print(messi_bio)
## height_cm weight_kg age
## 170 67 35
Notice that the first element in a vector has index 1, not 0 as in many other programming languages.
Once you have create vectors in R, you can use them to do calculations using R built-in functions as shown in the example below:
# Create vectors to store grades of every course for Thomas and Victoria
grades_thomas <- c(14, 15, 16, 7, 14, 9)
grades_victoria <- c(12, 8, 17, 15, 15, 13)
names(grades_thomas) = c("math", "english", "physics", "chemistry", "IT", "biology")
names(grades_victoria) = c("math", "english", "physics", "chemistry", "IT", "biology")
# display the two grade vectors
print(grades_thomas)
## math english physics chemistry IT biology
## 14 15 16 7 14 9
print(grades_victoria)
## math english physics chemistry IT biology
## 12 8 17 15 15 13
# Take the sum of the two vectors (element-wise addition) and assign it to new variables
total_grade <- grades_thomas + grades_victoria
print (total_grade)
## math english physics chemistry IT biology
## 26 23 33 22 29 22
# calculate the sum of grades for thomas using the sum() function
total_thomas = sum(grades_thomas)
print(total_thomas)
## [1] 75
# calculate the sum of grades for victoria using the sum() function
total_victoria = sum(grades_victoria)
print(total_victoria)
## [1] 80
# Find the maximum grade earned by Thomas using the max() function
thomas_max = max(grades_thomas)
print(thomas_max)
## [1] 16
# Find the minimum grade earned by Thomas using the min() function
thomas_min = min(grades_thomas)
print(thomas_min)
## [1] 7
## Find the average grade earned by Thomas using the mean() function
thomas_avg = mean(grades_thomas)
print(thomas_avg)
## [1] 12.5
The (logical) comparison operators known to R are:
As seen in the previous chapter, stating 6 > 5 returns TRUE. The nice thing about R is that you can use these comparison operators also on vectors. For example:
# Which courses have Thomas and Victoria failed ? (passing grade = 10)
grades_thomas >= 10
## math english physics chemistry IT biology
## TRUE TRUE TRUE FALSE TRUE FALSE
grades_victoria >= 10
## math english physics chemistry IT biology
## TRUE FALSE TRUE TRUE TRUE TRUE
# Which courses have Thomas and Victoria got excellent ? (excellent > = 17)
grades_thomas >= 17
## math english physics chemistry IT biology
## FALSE FALSE FALSE FALSE FALSE FALSE
grades_victoria >= 17
## math english physics chemistry IT biology
## FALSE FALSE TRUE FALSE FALSE FALSE
# You can use the output of the comparison operators and put it inside [] to select the elements which return TRUE. Example: Which courses have Thomas and Victoria failed ?
grades_thomas[grades_thomas < 10]
## chemistry biology
## 7 9
grades_victoria[grades_victoria < 10]
## english
## 8
Excercise: After one week in Las Vegas and played poker and roulette, you decide that it is time to start using your data analytical superpowers.
Before doing a first analysis, you decide to first collect all the winnings and losses for the last week:
For poker_vector:
For roulette_vector:
Task 1: To be able to use this data in R, create vectors for poker and roulette and store them into variables named poker_vector and roulette_vector from Monday to Friday.
Task 2: Assign days (Monday, Tuesday, Wednesday, Thursday & Friday) as names of elements for poker_vector and roulette_vector. Then print out the two vectors.
Task 3: How much you won/lost on each day? Assign the value to variable called total_daily.
Task 4: Calculate the total amount of money that you have won/lost with poker and assign to the variable total_poker. Do the same for roulette and assign it to total_roulette. Print out both variables.
Task 5: Now that you have the totals for roulette and poker, you can easily calculate the sum of all gains and losses of the week and assign it as total_week. Print out total_week.
Task 6: Assign the poker results of Wednesday to the variable poker_wednesday.
Task 7: Assign the poker results of Tuesday, Wednesday and Thursday to the variable poker_midweek.
Task 8: Assign to roulette_selection_vector the roulette results from Tuesday up to Friday. Make use of colon (:) for vector subsetting.
Task 9: Select the first three elements in poker_vector by using their names: “Monday”, “Tuesday” and “Wednesday”. Assign the result of the selection to poker_start. Then calculate the average of the values in poker_start with the. Print out the result so you can inspect it.
Task 10: Check which elements in poker_vector are positive (i.e. > 0) and assign this to selection_vector.Print out selection_vector so you can inspect it. The printout tells you whether you won (TRUE) or lost (FALSE) any money for each day. Then Use selection_vector in square brackets to assign the amounts that you won on the profitable days to the variable poker_winning_days.
In R, a matrix is a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns. Since you are only working with rows and columns, a matrix is called two-dimensional.
You can construct a matrix in R with the matrix() function. Consider the following example:
matrix(1:9, byrow = TRUE, nrow = 3)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
In the matrix() function:
In the following exercises you will analyze the box office numbers of the Star Wars franchise. Below three vectors are defined. Each one represents the box office numbers from the first three Star Wars movies. The first element of each vector indicates the US box office revenue, the second element refers to the Non-US box office (source: Wikipedia).
# Box office Star Wars (in millions!)
new_hope <- c(460.998, 314.4)
empire_strikes <- c(290.475, 247.900)
return_jedi <- c(309.306, 165.8)
# Use c() function to combine the three vectors into one vector. Call this vector box_office.
box_office <- c(new_hope, empire_strikes, return_jedi)
# Construct star_wars_matrix
star_wars_matrix <- matrix(box_office, byrow=TRUE, nrow=3)
print(star_wars_matrix)
## [,1] [,2]
## [1,] 460.998 314.4
## [2,] 290.475 247.9
## [3,] 309.306 165.8
To help you remember what is stored in star_wars_matrix, you would like to add the names of the movies for the rows. Not only does this help you to read the data, but it is also useful to select certain elements from the matrix.
Similar to vectors, you can add names for the rows and the columns of a matrix using the functions colnames() and rownames().
# Vectors region and titles.You will need these vectors to name the columns and rows of star_wars_matrix respectively.
region <- c("US", "non-US")
titles <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")
# Name the columns with region
colnames(star_wars_matrix) = region
# Name the rows with titles
rownames(star_wars_matrix) = titles
# Print out star_wars_matrix
print(star_wars_matrix)
## US non-US
## A New Hope 460.998 314.4
## The Empire Strikes Back 290.475 247.9
## Return of the Jedi 309.306 165.8
To calculate the total box office revenue for the three Star Wars movies, you have to take the sum of the US revenue column and the non-US revenue column.
In R, the function rowSums() conveniently calculates the totals for each row of a matrix. This function creates a new vector:
# Calculate worldwide box office figures
worldwide_vector <- rowSums(star_wars_matrix)
print(worldwide_vector)
## A New Hope The Empire Strikes Back Return of the Jedi
## 775.398 538.375 475.106
Just like rowSums(), we have another built-in R function that calculates the sum of a column. The colSums() function takes a matrix and returns the sum of the columns.
# Total revenue for US and non-US
total_revenue_vector <- colSums(star_wars_matrix)
# Print out total_revenue_vector
print(total_revenue_vector)
## US non-US
## 1060.779 728.100
Selection of matrix elements: Similar to vectors, you can use the square brackets [ ] to select one or multiple elements from a matrix. Whereas vectors have one dimension, matrices have two dimensions. You should therefore use a comma to separate the rows you want to select from the columns. For example:
If you want to select all elements of a row or a column, no number is needed before or after the comma, respectively:
# Select the non-US revenue for all movies
non_us_all <- star_wars_matrix[,2]
print(non_us_all)
## A New Hope The Empire Strikes Back Return of the Jedi
## 314.4 247.9 165.8
# Average non-US revenue
mean(non_us_all)
## [1] 242.7
# Select the non-US revenue for first two movies
non_us_some <- star_wars_matrix[1:2,2]
print(non_us_some)
## A New Hope The Empire Strikes Back
## 314.4 247.9
# Average non-US revenue for first two movies
mean(non_us_some)
## [1] 281.15
An arithmetic with matrices: Similar to what we have learned with vectors, the standard operators like +, -, /, *, etc. work in an element-wise way on matrices in R.
For example, 2 * my_matrix multiplies each element of my_matrix by two.
Assume that the price of a ticket was 5 dollars. Simply dividing the box office numbers by this ticket price gives you the number of visitors.
# Estimate the visitors
visitors <- star_wars_matrix / 5
# Print the estimate to the console
print(visitors)
## US non-US
## A New Hope 92.1996 62.88
## The Empire Strikes Back 58.0950 49.58
## Return of the Jedi 61.8612 33.16
# From the visitors matrix, select the entire first column, representing the number of visitors in the US. Store this selection as us_visitors.
us_visitors <- visitors[,1]
print(us_visitors)
## A New Hope The Empire Strikes Back Return of the Jedi
## 92.1996 58.0950 61.8612
# Calculate the average number of US visitors.;
mean(us_visitors)
## [1] 70.7186
Most datasets you will be working with will be stored as data frames. Remember that all the elements that you put in a matrix should be of the same type. Back then, your dataset on Star Wars only contained numeric elements.
When doing a market research survey, however, you often have questions such as:
The output, namely the respondents’ answers to the questions formulated above, is a dataset of different data types. You will often find yourself working with datasets that contain different data types instead of only one.
A data frame has the variables of a dataset as columns and the observations as rows.
The mtcars dataset is a built-in dataset in R that contains measurements on 11 different attributes for 32 different cars. We will analyze the dataset below:
# What is the type of the data type ? use the class() function
class(mtcars)
## [1] "data.frame"
# Print out built-in R data frame "mtcars"
mtcars
Wow, that is a lot of cars!
Working with large datasets is very common in data analysis. When you work with large datasets and data frames, your first task as a statistician is to develop a clear understanding of its structure and main elements. Therefore, it is often useful to show only a small part of the entire dataset.
The function head() enables you to show the first observations of a data frame.
# Call head() on mtcars
head(mtcars)
Another method that is often used to get a rapid overview of your data is the function str(). The function str() shows you the structure of your dataset:
# Investigate the structure of mtcars
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Selection of data frame elements: Similar to vectors and matrices, you select elements from a data frame with the help of square brackets [ ]. By using a comma, you can indicate what to select from the rows and the columns respectively. For example:
Sometimes you want to select all elements of a row or column. For example, my_df[1, ] selects all elements of the first row.
# Print out horse power (hp) of the Mazda RX4 (row 1, column 4)
mtcars[1,3]
## [1] 160
# Print out data for Datsun 710 (entire third row)
mtcars[3,]
Instead of using numerics to select elements of a data frame, you can also use the variable names to select columns of a data frame.
# Select the number of cylinders (cyl) for the first 8 cars.
mtcars[1:8, "cyl"]
## [1] 6 6 4 6 8 6 8 4
You will often want to select an entire column, namely one specific variable from a data frame. If you want to select all elements of the variable mpg (miles per gallon, both of these will do the trick:
However, there is a short-cut. If your columns have names, you can use the $ sign:
mtcars$mpg
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
Summarizing dataframe: We can use the summary() function to quickly summarize each variable in the dataset:
#summarize mtcars dataset
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000