Module 1: Introduction to R

1: Intro to basics

How it works

In the editor on the right you should type R code to solve the exercises. When you hit the ‘Submit Answer’ button, every line of code is interpreted and executed by R and you get a message whether or not your code was correct. The output of your R code is shown in the console in the lower right corner.

R makes use of the # sign to add comments, so that you and others can understand what the R code is about. Just like Twitter! Comments are not run as R code, so they will not influence your result. For example, Calculate 3 + 4 in the editor on the right is a comment.

You can also execute R commands straight in the console. This is a good way to experiment with R code, as your submission is not checked for correctness.

Calculate 3 + 4

3 + 4

## [1] 7

Calculate 6 + 12

6 + 12

## [1] 18

See how the console shows the result of the R code you submitted? Now that you’re familiar with the interface, let’s get down to R business!

Arithmetic with R

In its most basic form, R can be used as a simple calculator. Consider the following arithmetic operators:

Addition: +
Subtraction: -
Multiplication: *
Division: /
Exponentiation: ^
Modulo: %%

The last two might need some explaining:

The ^ operator raises the number to its left to the power of the number to its right: for example 3^2 is 9. The modulo returns the remainder of the division of the number to the left by the number on its right, for example 5 modulo 3 or 5 %% 3 is 2. With this knowledge, follow the instructions below to complete the exercise.

An addition

5 + 5

## [1] 10

A subtraction

5 - 5

## [1] 0

A multiplication

3 * 5

## [1] 15

A division

(5 + 5) / 2

## [1] 5

Exponentiation

2^5

## [1] 32

Modulo

28 %% 6

## [1] 4

Variable assignment

A basic concept in (statistical) programming is called a variable.

A variable allows you to store a value (e.g. 4) or an object (e.g. a function description) in R. You can then later use this variable’s name to easily access the value or the object that is stored within this variable.

You can assign a value 4 to a variable my_var with the command

my_var <- 4

Assign the value 42 to x

x <- 42

Print out the value of the variable x

## [1] 42

Have you noticed that R does not print the value of a variable to the console when you did the assignment? x <- 42 did not generate any output, because R assumes that you will be needing this variable in the future. Otherwise you wouldn’t have stored the value in a variable in the first place, right? Proceed to the next exercise!

Variable assignment (2)

Suppose you have a fruit basket with five apples. As a data analyst in training, you want to store the number of apples in a variable with the name my_apples.

Assign the value 5 to the variable my_apples

my_apples <- 5

Print out the value of the variable my_apples

my_apples

## [1] 5

Variable assignment (3)

Every tasty fruit basket needs oranges, so you decide to add six oranges. As a data analyst, your reflex is to immediately create the variable my_oranges and assign the value 6 to it. Next, you want to calculate how many pieces of fruit you have in total. Since you have given meaningful names to these values, you can now code this in a clear way:

my_apples + my_oranges

Assign a value to the variable my_oranges

my_oranges <- 6

Add these two variables together

my_apples + my_oranges

## [1] 11

Create the variable my_fruit

my_fruit <- my_apples + my_oranges

Nice one! The great advantage of doing calculations with variables is reusability. If you just change my_apples to equal 12 instead of 5 and rerun the script, my_fruit will automatically update as well. Continue to the next exercise.

Apples and oranges

Common knowledge tells you not to add apples and oranges. But hey, that is what you just did, no :-)? The my_apples and my_oranges variables both contained a number in the previous exercise. The + operator works with numeric variables in R. If you really tried to add “apples” and “oranges”, and assigned a text value to the variable my_oranges (see the editor), you would be trying to assign the addition of a numeric and a character variable to the variable my_fruit. This is not possible.

Basic data types in R

R works with numerous data types. Some of the most basic types to get started are:

Decimal values like 4.5 are examples of the data type numeric. Whole numbers like 3, 4, 0 and -4 are called integers. Integers are also examples of the numeric data type. Boolean values (TRUE or FALSE) are called logical. Text (or string) values are examples of the data type character. Note how the quotation marks indicate that “the text inside quotation marks” is a character.

Change my_numeric to be 42

my_numeric <- 42

Change my_character to be “universe”

my_character <- "universe"

Change my_logical to be FALSE

my_logical <- FALSE

What’s the data type?

If you set the variable my_oranges to have the value 5 and the variable my_oranges to have the value “six”, attempting to add them will give an error. This is due to a mismatch in data types? You can add two (or more) numerics together, but you can’t add a numeric and a character together. Avoid such embarrassing situations by checking the data type of a variable beforehand. You can do this with the class() function, as the code below shows.

Check class of my_numeric

class(my_numeric)

## [1] "numeric"

Check class of my_character

class(my_character)

## [1] "character"

Check class of my_logical

class(my_logical)

## [1] "logical"

This was the last exercise for this chapter. Head over to the next chapter to get immersed in the world of vectors!

2: Vectors

Create a vector

Feeling lucky? You better, because this chapter takes you on a trip to the City of Sins, also known as Statisticians Paradise!

Thanks to R and your new data-analytical skills, you will learn how to uplift your performance at the tables and fire off your career as a professional gambler. This chapter will show how you can easily keep track of your betting progress and how you can do some simple analyses on past actions. Next stop, Vegas Baby… VEGAS!!

Define the variable vegas

vegas <- "Go!"

Create a vector (2)

Let us focus first!

On your way from rags to riches, you will make extensive use of vectors. Vectors are one-dimension arrays that can hold numeric data, character data, or logical data. In other words, a vector is a simple tool to store data. For example, you can store your daily gains and losses in the casinos.

In R, you create a vector with the combine function c(). You place the vector elements separated by a comma between the parentheses. For example:

numeric_vector <- c(1, 2, 3)
character_vector <- c("a", "b", "c")

Once you have created these vectors in R, you can use them to do calculations.

numeric_vector <- c(1, 10, 49)
character_vector <- c("a", "b", "c")

Complete the code for boolean_vector

boolean_vector <- c(TRUE,  FALSE, TRUE)

Notice that adding a space behind the commas in the c() function improves the readability of your code, but changes nothing else. Let’s practice some more with vector creation in the next exercise.

Create a vector (3)

Poker winnings from Monday to Friday

poker_vector <- c(140, -50, 20, -120, 240)

Roulette winnings from Monday to Friday

roulette_vector <- c(-24, -50, 100, -350, 10)

To check out the contents of your vectors, remember that you can always simply type the variable in the console and hit Enter.

Naming a vector

As a data analyst, it is important to have a clear view on the data that you are using. Understanding what each element refers to is therefore essential.

In the previous exercise, we created a vector with your winnings over the week. Each vector element refers to a day of the week but it is hard to tell which element belongs to which day. It would be nice if you could show that in the vector itself.

You can give a name to the elements of a vector with the names() function. Have a look at this example:

some_vector <- c("John Doe", "poker player") names(some_vector) <- c("Name", "Profession")

This code first creates a vector some_vector and then gives the two elements a name. The first element is assigned the name Name, while the second element is labeled Profession. Printing the contents to the console yields following output:

      Name     Profession 
"John Doe" "poker player"

Assign days as names of poker_vector

names(poker_vector) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")

Assign days as names of roulette_vector

names(roulette_vector) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")

Naming a vector (2)

If you want to become a good statistician, you have to become lazy. (If you are already lazy, chances are high you are one of those exceptional, natural-born statistical talents.)

In the previous exercises you probably experienced that it is boring and frustrating to type and retype information such as the days of the week. However, when you look at it from a higher perspective, there is a more efficient way to do this, namely, to assign the days of the week vector to a variable!

Just like you did with your poker and roulette returns, you can also create a variable that contains the days of the week. This way you can use and re-use it.

The variable days_vector

days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")

Assign the names of the day to roulette_vector and poker_vector

names(roulette_vector) <- days_vector
names(poker_vector) <- days_vector

A word of advice: try to avoid code duplication at all times. Continue to the next exercise and learn how to do arithmetic with vectors!

Calculating total winnings

Now that you have the poker and roulette winnings nicely as named vectors, you can start doing some data analytical magic.

You want to find out the following type of information:

How much has been your overall profit or loss per day of the week? Have you lost money over the week in total? Are you winning/losing money on poker or on roulette? To get the answers, you have to do arithmetic calculations on vectors.

It is important to know that if you sum two vectors in R, it takes the element-wise sum. For example, the following three statements are completely equivalent:

c(1, 2, 3) + c(4, 5, 6)
c(1 + 4, 2 + 5, 3 + 6)
c(5, 7, 9)

You can also do the calculations with variables that represent vectors:

a <- c(1, 2, 3)
b <- c(4, 5, 6)
c <- a + b

A_vector <- c(1, 2, 3)
B_vector <- c(4, 5, 6)

Take the sum of A_vector and B_vector

total_vector <- A_vector + B_vector

Print out total_vector

total_vector

## [1] 5 7 9

Calculating total winnings (2)

Now you understand how R does arithmetic with vectors, it is time to get those Ferraris in your garage! First, you need to understand what the overall profit or loss per day of the week was. The total daily profit is the sum of the profit/loss you made on poker per day, and the profit/loss you made on roulette per day.

In R, this is just the sum of roulette_vector and poker_vector.

Assign to total_daily how much you won/lost on each day

total_daily <- poker_vector + roulette_vector

Display the daily total

total_daily

##    Monday   Tuesday Wednesday  Thursday    Friday 
##       116      -100       120      -470       250

Calculating total winnings (3)

Based on the previous analysis, it looks like you had a mix of good and bad days. This is not what your ego expected, and you wonder if there may be a very tiny chance you have lost money over the week in total?

A function that helps you to answer this question is sum(). It calculates the sum of all elements of a vector. For example, to calculate the total amount of money you have lost/won with poker you do:

total_poker <- sum(poker_vector)

Total winnings with poker

total_poker <- sum(poker_vector)
total_poker

## [1] 230

Total winnings with roulette

total_roulette <- sum(roulette_vector)
total_roulette

## [1] -314

Total winnings overall

total_week <- total_poker + total_roulette

Print out total_week

total_week

## [1] -84

Comparing total winnings

Oops, it seems like you are losing money. Time to rethink and adapt your strategy! This will require some deeper analysis…

After a short brainstorm in your hotel’s jacuzzi, you realize that a possible explanation might be that your skills in roulette are not as well developed as your skills in poker. So maybe your total gains in poker are higher (or > ) than in roulette.

Check if you realized higher total gains in poker than in roulette

total_poker > total_roulette

## [1] TRUE

Vector selection: the good times

Your hunch seemed to be right. It appears that the poker game is more your cup of tea than roulette.

Another possible route for investigation is your performance at the beginning of the working week compared to the end of it. You did have a couple of Margarita cocktails at the end of the week…

To answer that question, you only want to focus on a selection of the total_vector. In other words, our goal is to select specific elements of the vector. To select elements of a vector (and later matrices, data frames, …), you can use square brackets. Between the square brackets, you indicate what elements to select. For example, to select the first element of the poker vector, you type poker_vector[1]. To select the second element of the vector, you type poker_vector[2], etc. Notice that the first element in a vector in R has index 1, not 0 as in many other programming languages.

Define a new variable based on a selection

poker_wednesday <- poker_vector[3]

R also makes it possible to select multiple elements from a vector at once. Learn how in the next exercise!

Vector selection: the good times (2)

How about analyzing your midweek results?

To select multiple elements from a vector, you can add square brackets at the end of it. You can indicate between the brackets what elements should be selected. For example: suppose you want to select the first and the fifth day of the week: use the vector c(1, 5) between the square brackets. For example, the code below selects the first and fifth element of poker_vector:

poker_vector[c(1, 5)]

Define a new variable based on a selection

poker_midweek <- poker_vector[c(2,3,4)]

Continue to the next exercise to specialize in vector selection some more!

Vector selection: the good times (3)

Selecting multiple elements of poker_vector with c(2, 3, 4) is not very convenient. Many statisticians are lazy people by nature, so they created an easier way to do this: c(2, 3, 4) can be abbreviated to 2:4, which generates a vector with all natural numbers from 2 up to and including 4.

So, another way to find the mid-week results is poker_vector[2:4].

Define a new variable based on a selection

roulette_selection_vector <- roulette_vector[2:5]

The colon operator is extremely useful and very often used in R programming, so remember it well.

Vector selection: the good times (4)

Another way to tackle the previous exercise is by using the names of the vector elements (Monday, Tuesday, …) instead of their numeric positions. For example,

poker_vector["Monday"]

will select the first element of poker_vector because “Monday” is the name of that first element.

Just like you did in the previous exercise with numerics, you can also use the element names to select multiple elements, for example:

poker_vector[c("Monday","Tuesday")]

Select poker results for Monday, Tuesday and Wednesday

poker_start <- poker_vector[c("Monday", "Tuesday","Wednesday")]

Calculate the average of the elements in poker_start

mean(poker_start)

## [1] 36.66667

Good job! Apart from subsetting vectors by index or by name, you can also subset vectors by comparison. The next exercises will show you how!

Selection by comparison - Step 1

By making use of comparison operators, we can approach the previous question in a more proactive way.

The (logical) comparison operators known to R are:

< for less than
> for greater than
<= for less than or equal to
>= for greater than or equal to
== for equal to each other
!= not equal to each other

As seen in the previous chapter, stating 6 > 5 returns TRUE. The nice thing about R is that you can use these comparison operators also on vectors. For example:

c(4, 5, 6) > 5
[1] FALSE FALSE TRUE

This command tests for every element of the vector if the condition stated by the comparison operator is TRUE or FALSE.

Which days did you make money on poker?

poker_vector > 0

##    Monday   Tuesday Wednesday  Thursday    Friday 
##      TRUE     FALSE      TRUE     FALSE      TRUE

selection_vector <- poker_vector > 0

Print out selection_vector

selection_vector

##    Monday   Tuesday Wednesday  Thursday    Friday 
##      TRUE     FALSE      TRUE     FALSE      TRUE

Selection by comparison - Step 2

Working with comparisons will make your data analytical life easier. Instead of selecting a subset of days to investigate yourself (like before), you can simply ask R to return only those days where you realized a positive return for poker.

In the previous exercises you used selection_vector <- poker_vector > 0 to find the days on which you had a positive poker return. Now, you would like to know not only the days on which you won, but also how much you won on those days.

You can select the desired elements, by putting selection_vector between the square brackets that follow poker_vector:

poker_vector[selection_vector]

R knows what to do when you pass a logical vector in square brackets: it will only select the elements that correspond to TRUE in selection_vector.

Select from poker_vector these days

poker_winning_days <- poker_vector[selection_vector]
poker_winning_days

##    Monday Wednesday    Friday 
##       140        20       240

I printed out selection_vector and poker_winning_days along the way, just to be sure I was doing it right.

Advanced selection

Just like you did for poker, you also want to know those days where you realized a positive return for roulette.

Which days did you make money on roulette?

selection_vector <- roulette_vector > 0

Notice how we’re reusing selection_vector for the second time.

Select from roulette_vector these days

roulette_winning_days <- roulette_vector[selection_vector]

This exercise concludes the chapter on vectors. The next chapter will introduce you to the two-dimensional version of vectors: matrices.

3: Matrices

What’s a matrix?

In R, a matrix is a collection of elements arranged into a fixed number of rows and columns. Since you are only working with rows and columns, a matrix is called two-dimensional.

You can construct a matrix in R with the matrix() function. Consider the following example:

matrix(1:9, byrow = TRUE, nrow = 3)

In the matrix() function:

the first argument is the collection of elements that R will arrange into the rows and columns of the matrix. Here, we use 1:9 which, as we’ve seen before, is a shortcut for c(1, 2, 3, 4, 5, 6, 7, 8, 9).
the next argument, byrow, indicates that the matrix is filled by the rows. If we want the matrix to be filled by the columns, we just place byrow = FALSE.
The third argument, nrow indicates that the matrix should have three rows.

Construct a matrix with 3 rows that contains the numbers 1 up to 9

matrix(1:9, byrow=TRUE, nrow = 3)

##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

What happens, do you think, if you change TRUE in the above code to FALSE? Try it and see.

Analyze matrices, you shall

It is now time to get your hands dirty. In the following exercises you will analyze the box office numbers of the Star Wars franchise. May the force be with you!

In the editor, three vectors are defined. Each one represents the box office numbers from the first three Star Wars movies. The first element of each vector indicates the US box office revenue, the second element refers to the Non-US box office (source: Wikipedia).

In this exercise, you’ll combine all these figures into a single vector. Next, you’ll build a matrix from this vector.

Box office Star Wars (in millions!)

new_hope <- c(460.998, 314.4)
empire_strikes <- c(290.475, 247.900)
return_jedi <- c(309.306, 165.8)

In other words, three vectors, all of length 3, containing numerics.

Create box_office

box_office <- c(new_hope, empire_strikes, return_jedi)

In other words, one vector, called box_office, of length 6.

Print box_office

box_office

## [1] 460.998 314.400 290.475 247.900 309.306 165.800

Construct star_wars_matrix

star_wars_matrix <- matrix(box_office, byrow = TRUE, nrow = 3)

In other words, a matrix, called star_wars_matrix, with 3 rows, which has been filled, by row, with the 6 elements of the vector box-office. This forces there to be two elements per row.

What happens if you try to fill a matrix with 3 rows using a vector containing 5 elements, or 7 elements, or 8 elements. Try it and see.

You’ll see that to fill a matrix containing n rows, you need a multiple of n elements. Draw a picture of a matrix to help you understand why.

The force is actually with you!

Naming a matrix

To help you remember what is stored in star_wars_matrix, you would like to add the names of the movies for the rows. Not only does this help you to read the data, but it is also useful to select certain elements from the matrix.

Previously we used the function names() to name the elements of a vector. As we’re dealing with two-dimensional arrays now, we need to be a little more specific. Similar to vectors, you can add names for the rows and the columns of a matrix

rownames(my_matrix) <- row_names_vector
colnames(my_matrix) <- col_names_vector

Below we’ll create two new vectors for use in this chapter: region, and titles. You will use these vectors to name the columns and rows of star_wars_matrix, respectively.

Create vectors region and titles, used for naming

region <- c("US", "non-US")
titles <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")

Name the columns with region

colnames(star_wars_matrix) <- region

Name the rows with titles

rownames(star_wars_matrix) <- titles

Print out star_wars_matrix

star_wars_matrix

##                              US non-US
## A New Hope              460.998  314.4
## The Empire Strikes Back 290.475  247.9
## Return of the Jedi      309.306  165.8

How large is this matrix? Is it still 3 rows and 3 columns? Yes, despite it looking larger now we’ve added names to it, this matrix is still a 3 x 2 matrix. Check this using the dimension function dim().

Check dimension of star_wars_matrix

dim(star_wars_matrix)

## [1] 3 2

Notice it gives you the number of rows first, then the number of columns.

How, using the vector box_office, would you create a matrix with 2 columns and 3 rows? We might look at this in one of our challenges.

Calculating the worldwide box office

The single most important thing for a movie in order to become an instant legend in Tinseltown is its worldwide box office figures.

To calculate the total box office revenue for the three Star Wars movies, you have to take the sum of the US revenue column and the non-US revenue column.

In R, when dealing with marices, the function rowSums() conveniently calculates the totals for each row of a matrix. This function creates a new vector:

rowSums(my_matrix)

which we can assign to a variable.

Calculate worldwide box office figures

worldwide_vector <- rowSums(star_wars_matrix)
worldwide_vector

##              A New Hope The Empire Strikes Back      Return of the Jedi 
##                 775.398                 538.375                 475.106

What does this vector tell you exactly?

What about the vector colSums(star_wars_matrix)? What will this vector tell us? Have a thought then give this vector a name.

us_and_abroad_vector <- colSums(star_wars_matrix)

Using dimnames()

Hopefully you’re starting to understand (if you didn’t already) that when we refer to matrices, we always refer to their rows first, then columns. For example, a 3 x 2 matrix is one with 3 rows and 2 columns.

This is true also when we build a matrix from scratch. Below is code which builds the same matrix we’ve been building over the past few exercises:

Construct star_wars_matrix

box_office <- c(460.998, 314.4, 290.475, 247.900, 309.306, 165.8)
star_wars_matrix <- matrix(box_office, nrow = 3, byrow = TRUE,
                           dimnames = list(c("A New Hope", "The Empire Strikes Back", "Return of the Jedi"), 
                                           c("US", "non-US")))

Notice how rather than use the functions rownames() and colnames(), we used the function dimnames() to names the rows and columns. We wrote dimnames = list(c("row", "names"), c("column", "names")).

We also made use of the list function, which we’ll learn to use fully in a later chapter in this module.

Adding a column for the Worldwide box office

In a previous exercise you calculated the vector that contained the worldwide box office receipt for each of the three Star Wars movies. However, this vector is not yet part of star_wars_matrix.

You can add a column or multiple columns to a matrix with the cbind() function, which merges matrices and/or vectors together by column. For example:

big_matrix <- cbind(matrix1, matrix2, vector1 ...)

Let’s add the worldwide totals to the matrix.

Bind the new variable worldwide_vector as a column to star_wars_matrix

all_wars_matrix <- cbind(star_wars_matrix, worldwide_vector)

Obviously this step will only work if the two matrices have the same number of rows.

After adding this column to our matrix, the logical next step is to add rows. Learn how in the next exercise…

Adding a row

For this exercise, we’ll use our existing matrix star_wars_matrix, which covered the original trilogy of movies, and a second matrix star_wars_matrix2, also 3 x 2, which will contain the same data for the prequels trilogy.

By the way, if you ever want to check out the contents of the workspace you’re working in, you can type ls() in the console.

Creating the prequels matrix star_wars_matrix2

box_office <- c(474.5, 552.5, 310.7, 338.7, 380.3, 468.5)
star_wars_matrix2 <- matrix(box_office, nrow = 3, byrow = TRUE,
                            dimnames = list(c("The Phantom Menace", "Attack of the Clones", "Revenge of the Sith"),
                                            c("US", "non-US")))

Combine both Star Wars trilogies in one matrix

all_wars_matrix <- rbind(star_wars_matrix, star_wars_matrix2)

Again, we had to be sure, which we were, that the two matrices contained the same number of columns into to be able to successfully merge them.

The total box office revenue for the entire saga

Would you use colSums()or rowSums() on the matrix all_wars_matrix to calculate the total box office revenue for the entire saga in the US versus abroad?

That’s right! It’s colSums()!

Total revenue for US and non-US

total_revenue_vector <- colSums(all_wars_matrix)

Print out total_revenue_vector

Head over to the next exercise to learn about matrix subsetting.

Selection of matrix elements

Similar to vectors, you can use the square brackets [ ] to select one or multiple elements from a matrix. Whereas vectors have one dimension, matrices have two dimensions. You should therefore use a comma to separate the rows you want to select from the columns. Notice again that rows go first, then columns. For example:

my_matrix[1,2] selects the element at the first row and second column.
my_matrix[1:3,2:4] results in a matrix with the data on the rows 1, 2, 3 and columns 2, 3, 4.

If you want to select all elements of a row or a column, no number is needed before or after the comma, respectively:

my_matrix[,1] selects all elements of the first column.
my_matrix[1,] selects all elements of the first row.

What will my_matrix[1] select? Does it even selest anything??

Back to Star Wars with this newly acquired knowledge!

Select the non-US revenue for all movies

non_us_all <- all_wars_matrix[,2]

Average non-US revenue

mean(non_us_all)

## [1] 347.9667

Select the non-US revenue for first two movies

non_us_some <- all_wars_matrix[1:2,2]

Average non-US revenue for first two movies

mean(non_us_some)

## [1] 281.15

A little arithmetic with matrices

Similar to what you have learned with vectors, the standard operators like +, -, /, *, etc. work in an element-wise way on matrices in R.

For example, 2 * my_matrix multiplies each element of my_matrix by two.

As a newly-hired data analyst for Lucasfilm, it is your job to find out how many visitors went to each movie for each geographical area. You already have the total revenue figures (in the matrix all_wars_matrix). Assume that the price of a ticket was 5 dollars. Simply dividing the box office numbers by this ticket price gives you the number of visitors.

Estimate the visitors

visitors <- all_wars_matrix / 5

Print the estimate to the console

visitors

##                              US non-US
## A New Hope              92.1996  62.88
## The Empire Strikes Back 58.0950  49.58
## Return of the Jedi      61.8612  33.16
## The Phantom Menace      94.9000 110.50
## Attack of the Clones    62.1400  67.74
## Revenge of the Sith     76.0600  93.70

What do these results tell you? A staggering 92 million people went to see A New Hope in US theaters!

–>

A little arithmetic with matrices (2)

After looking at the result of the previous exercise, big boss Lucas points out that the ticket prices went up over time. He asks to redo the analysis based on the prices you can find in ticket_prices_matrix (source: imagination).

_Those who are familiar with matrices should note that this is not the standard matrix multiplication for which you should use %*% in R._

Creating ticket_prices_matrix

ticket_prices <- c(5.0, 5.0, 6.0, 6.0, 7.0, 7.0, 4.0, 4.0, 4.5, 4.5, 4.9, 4.9)
ticket_prices_matrix <- matrix(ticket_prices, nrow = 6, byrow = TRUE,
                               dimnames = list(c("A New Hope", "The Empire Strikes Back", "Return of the Jedi","The Phantom Menace", "Attack of the Clones", "Revenge of the Sith"),
                                               c("US", "non-US")))

Estimated number of visitors

visitors <- all_wars_matrix / ticket_prices_matrix

US visitors

us_visitors <- visitors[,1]

Average number of US visitors

mean(us_visitors)

## [1] 75.01339

This exercise concludes the chapter on matrices. Next stop on your journey through the R language: factors.

4: Factors

What’s a factor and why would you use it?

In this chapter you dive into the wonderful world of factors.

The term factor refers to a statistical data type used to store categorical variables. The difference between a categorical variable and a continuous variable is that a categorical variable can belong to a limited number of categories. A continuous variable, on the other hand, can correspond to an infinite number of values.

It is important that R knows whether it is dealing with a continuous or a categorical variable, as the statistical models you will develop in the future treat both types differently. (You will see later why this is the case.)

A good example of a categorical variable is sex. In many circumstances you can limit the sex categories to “Male” or “Female”. (Sometimes you may need different categories. For example, you may need to consider chromosomal variation, hermaphroditic animals, or different cultural norms, but you will always have a finite number of categories.)

Assign to the variable “theory” what this chapter is about!

theory <- "factors used for categorical variables"

What’s a factor and why would you use it? (2)

To create factors in R, you make use of the function factor(). First thing that you have to do is create a vector that contains all the observations that belong to a limited number of categories. For example, sex_vector contains the sex of 5 different individuals:

sex_vector <- c("Male","Female","Female","Male","Male")

It is clear that there are two categories, or in R-terms factor levels, at work here: “Male” and “Female”.

The function factor() will encode the vector as a factor:

factor_sex_vector <- factor(sex_vector)

Sex vector

sex_vector <- c("Male", "Female", "Female", "Male", "Male")
class(sex_vector[2])

## [1] "character"

class(sex_vector)

## [1] "character"

So currently R does not know that sex_vector are (or should be treated as) categories.

Convert sex_vector to a factor

factor_sex_vector <- factor(sex_vector)

Print out factor_sex_vector

factor_sex_vector

## [1] Male   Female Female Male   Male  
## Levels: Female Male

If you want to find out more about the factor() function, do not hesitate to type ?factor in the console. This will open up a help page. Continue to the next exercise.

What’s a factor and why would you use it? (3)

There are two types of categorical variables: a nominal categorical variable and an ordinal categorical variable.

A nominal variable is a categorical variable without an implied order. This means that it is impossible to say that ‘one is worth more than the other’. For example, think of the categorical variable animals_vector with the categories “Elephant”, “Giraffe”, “Donkey” and “Horse”. Here, it is impossible to say that one stands above or below the other. (Note that some of you might disagree ;-) ).

In contrast, ordinal variables do have a natural ordering. Consider for example the categorical variable temperature_vector with the categories: “Low”, “Medium” and “High”. Here it is obvious that “Medium” stands above “Low”, and “High” stands above “Medium”.

Animals

animals_vector <- c("Elephant", "Giraffe", "Donkey", "Horse")
factor_animals_vector <- factor(animals_vector)
factor_animals_vector

## [1] Elephant Giraffe  Donkey   Horse   
## Levels: Donkey Elephant Giraffe Horse

Temperature

temperature_vector <- c("High", "Low", "High","Low", "Medium")
factor_temperature_vector <- factor(temperature_vector, order = TRUE, levels = c("Low", "Medium", "High"))
factor_temperature_vector

## [1] High   Low    High   Low    Medium
## Levels: Low < Medium < High

Can you already tell what’s happening in this exercise? Awesome! Continue to the next exercise and get into the details of factor levels.

Factor levels

When you first get a data set, you will often notice that it contains factors with specific factor levels. However, sometimes you will want to change the names of these levels for clarity or other reasons. R allows you to do this with the function levels():

levels(factor_vector) <- c("name1", "name2",...)

A good illustration is the raw data that is provided to you by a survey. A common question for every questionnaire is the sex of the respondent. Here, for simplicity, just two categories were recorded, “M” and “F”. (You usually need more categories for survey data; either way, you use a factor to store the categorical data.)

survey_vector <- c("M", "F", "F", "M", "M")

Recording the sex with the abbreviations “M” and “F” can be convenient if you are collecting data with pen and paper, but it can introduce confusion when analyzing the data. At that point, you will often want to change the factor levels to “Male” and “Female” instead of “M” and “F” for clarity.

Watch out: the order with which you assign the levels is important. If you type levels(factor_survey_vector), you’ll see that it outputs [1] “F” “M”. If you don’t specify the levels of the factor when creating the vector, R will automatically assign them alphabetically. To correctly map “F” to “Female” and “M” to “Male”, the levels should be set to c("Female", "Male"), in this order.

Code to build factor_survey_vector

survey_vector <- c("M", "F", "F", "M", "M")
factor_survey_vector <- factor(survey_vector)
levels(factor_survey_vector)

## [1] "F" "M"

Specify the levels of factor_survey_vector

levels(factor_survey_vector) <- c("Female", "Male")

Summarizing a factor

After finishing this course, one of your favorite functions in R will be summary(). This will give you a quick overview of the contents of a variable:

summary(my_var)

Going back to our survey, you would like to know how many “Male” responses you have in your study, and how many “Female” responses. The summary() function gives you the answer to this question.

Generate summary for survey_vector

summary(survey_vector)

##    Length     Class      Mode 
##         5 character character

Generate summary for factor_survey_vector

summary(factor_survey_vector)

## Female   Male 
##      2      3

Have a look at the output. The fact that you identified “Male” and “Female” as factor levels in factor_survey_vector enables R to show the number of elements for each category.

Battle of the sexes

You might wonder what happens when you try to compare elements of a factor. In factor_survey_vector you have a factor with two levels: “Male” and “Female”. But how does R value these relative to each other?

Male

male <- factor_survey_vector[1]

Female

female <- factor_survey_vector[2]

Battle of the sexes: Male ‘larger’ than female?

male > female

## Warning in Ops.factor(male, female): '>' not meaningful for factors

## [1] NA

By default, R returns NA when you try to compare values in a factor, since the idea doesn’t make sense. Next you’ll learn about ordered factors, where more meaningful comparisons are possible.

Ordered factors

Since “Male” and “Female” are unordered (or nominal) factor levels, R returns a warning message, telling you that the greater than operator is not meaningful. As seen before, R attaches an equal value to the levels for such factors.

But this is not always the case! Sometimes you will also deal with factors that do have a natural ordering between its categories. If this is the case, we have to make sure that we pass this information to R…

Let us say that you are leading a research team of five data analysts and that you want to evaluate their performance. To do this, you track their speed, evaluate each analyst as “slow”, “medium” or “fast”, and save the results in speed_vector.

Create speed_vector

speed_vector <- c("medium", "slow", "slow","medium", "fast")

Ordered factors (2)

speed_vector should be converted to an ordinal factor since its categories have a natural ordering. By default, the function factor() transforms speed_vector into an unordered factor. To create an ordered factor, you have to add two additional arguments: ordered and levels.

factor(some_vector,
ordered = TRUE,
levels = c(“lev1”, “lev2” …))

By setting the argument ordered to TRUE in the function factor(), you indicate that the factor is ordered. With the argument levels you give the values of the factor in the correct order.

Convert speed_vector to ordered factor vector

factor_speed_vector <- factor(speed_vector,
                        ordered = TRUE,
                        levels = c("slow", "medium", "fast"))

Print factor_speed_vector

factor_speed_vector

## [1] medium slow   slow   medium fast  
## Levels: slow < medium < fast

Have a look at the console. It is now indicated that the Levels indeed have an order associated, with the < sign. Continue to the next exercise.

Comparing ordered factors

Having a bad day at work, ‘data analyst number two’ enters your office and starts complaining that ‘data analyst number five’ is slowing down the entire project. Since you know that ‘data analyst number two’ has the reputation of being a smarty-pants, you first decide to check if his statement is true.

The fact that factor_speed_vector is now ordered enables us to compare different elements (the data analysts in this case). You can simply do this by using the well-known operators.

Factor value for second data analyst

da2 <- factor_speed_vector[2]

Factor value for fifth data analyst

da5 <- factor_speed_vector[5]

Is data analyst 2 faster than data analyst 5?

da2 > da5

## [1] FALSE

What do the results tell you? Data analyst two is complaining about the data analyst five while in fact they are the one slowing everything down! This concludes the chapter on factors. With a solid basis in vectors, matrices and factors, you’re ready to dive into the wonderful world of data frames, a very important data structure in R!

5: Data frames

What’s a data frame?

You may remember from the chapter about matrices that all the elements that you put in a matrix should be of the same type. Back then, your data set on Star Wars only contained numeric elements.

When doing a market research survey, however, you often have questions such as:

‘Are you married?’ or ‘yes/no’ questions (logical) ‘How old are you?’ (numeric) ‘What is your opinion on this product?’ or other ‘open-ended’ questions (character) … The output, namely the respondents’ answers to the questions formulated above, is a data set of different data types. You will often find yourself working with data sets that contain different data types instead of only one.

A data frame has the variables of a data set as columns and the observations as rows. This will be a familiar concept for those coming from different statistical software packages such as SAS or SPSS.

Print out built-in R data frame

mtcars

##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Quick, have a look at your data set

Wow, that is a lot of cars!

Working with large data sets is not uncommon in data analysis. When you work with (extremely) large data sets and data frames, your first task as a data analyst is to develop a clear understanding of its structure and main elements. Therefore, it is often useful to show only a small part of the entire data set.

So how to do this in R? Well, the function head() enables you to show the first observations of a data frame. Similarly, the function tail() prints out the last observations in your data set.

Both head() and tail() print a top line called the ‘header’, which contains the names of the different variables in your data set.

Call head() on mtcars

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

tail(mtcars)

##                 mpg cyl  disp  hp drat    wt qsec vs am gear carb
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

So, what do we have in this data set? For example, hp represents the car’s horsepower; the Datsun has the lowest horse power of the 6 cars that are displayed. For a full overview of the variables’ meaning, type ?mtcars in the console and read the help page.

Having used ?mtcars to view an explanation of the variables represented in this dataframe, what does the variable am refer to? And hp?

Have a look at the structure

Another method that is often used to get a rapid overview of your data is the function str(). The function str() shows you the structure of your data set. For a data frame it tells you:

The total number of observations (e.g. 32 types of car were tested)
The total number of variables (e.g. 11 car features)
A full list of the variables names (e.g. mpg, cyl … )
The data type of each variable (e.g. num)
The first observations

Applying the str() function will often be the first thing that you do when receiving a new data set or data frame. It is a great way to get more insight into your data set before diving into the real analysis.

Investigate the structure of mtcars

mtcars

##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Creating a data frame

Since using built-in data sets is not even half the fun of creating your own data sets, the rest of this chapter is based on your personally developed data set. Put your jet pack on because it is time for some space exploration!

As a first goal, you want to construct a data frame that describes the main characteristics of eight planets in our solar system. According to your good friend Buzz, the main features of a planet are:

The type of planet (Terrestrial or Gas Giant).
The planet’s diameter relative to the diameter of the Earth.
The planet’s rotation across the sun relative to that of the Earth.
If the planet has rings or not (TRUE or FALSE).

After doing some high-quality research on Wikipedia, you feel confident enough to create the necessary vectors: name, type, diameter, rotation and rings.

Definition of vectors

name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)

The first element in each of these vectors correspond to the first observation.

Create a data frame from the vectors

Next we construct a data frame with the data.frame() function. As arguments, you pass the vectors from before: they will become the different columns of your data frame. Because every column has the same length, the vectors you pass should also have the same length. But don’t forget that it is possible (and likely) that they contain different types of data.

planets_df <- data.frame(name, type, diameter, rotation, rings)

The logical next step, as you know by now, is inspecting the data frame you just created.

Creating a data frame (2)

The planets_df data frame we just created should have 8 observations and 5 variables. Let’s double check!

Check the structure of planets_df

str(planets_df)

## 'data.frame':    8 obs. of  5 variables:
##  $ name    : chr  "Mercury" "Venus" "Earth" "Mars" ...
##  $ type    : chr  "Terrestrial planet" "Terrestrial planet" "Terrestrial planet" "Terrestrial planet" ...
##  $ diameter: num  0.382 0.949 1 0.532 11.209 ...
##  $ rotation: num  58.64 -243.02 1 1.03 0.41 ...
##  $ rings   : logi  FALSE FALSE FALSE FALSE TRUE TRUE ...

Now that you have a clear understanding of the planets_df data set, it’s time to see how you can select elements from it.

Selection of data frame elements

Similar to vectors and matrices, you select elements from a data frame with the help of square brackets [ ]. By using a comma, you can indicate what to select from the rows and the columns respectively. For example:

my_df[1,2] selects the value at the first row and second column in my_df.
my_df[1:3,2:4] selects the values that appear in columns 2, 3, 4 of rows 1, 2, 3 in my_df.
my_df[1, ] selects all elements of the first row.
my_df[, 4] selects all elements of the fourth column.

Print out diameter of Mercury (row 1, column 3)

planets_df[1,3]

## [1] 0.382

Print out data for Mars (entire fourth row)

planets_df[4,]

##   name               type diameter rotation rings
## 4 Mars Terrestrial planet    0.532     1.03 FALSE

Apart from selecting elements from your data frame by index, you can also use the column names.

Selection of data frame elements (2)

Instead of using numerics to select elements of a data frame, you can also use the variable names to select columns of a data frame.

Suppose you want to select the first three elements of the type column. One way to do this is

planets_df[1:3,2]

A possible disadvantage of this approach is that you have to know (or look up) the column number of type, which gets hard if you have a lot of variables. It is often easier to just make use of the variable name:

planets_df[1:3,"type"]

Select the first 5 values of diameter column

planets_df[1:5, "diameter"]

## [1]  0.382  0.949  1.000  0.532 11.209

planets_df[, "diameter"]

## [1]  0.382  0.949  1.000  0.532 11.209  9.449  4.007  3.883

Only planets with rings

You will often want to select an entire column, namely one specific variable from a data frame. If you want to select all elements of the variable diameter, for example, both of these will do the trick:

planets_df[,3] planets_df[,"diameter"]

However, there is a short-cut. If your columns have names, you can use the $ sign:

planets_df$diameter

View the diamter variable from planets_df

planets_df$diameter

## [1]  0.382  0.949  1.000  0.532 11.209  9.449  4.007  3.883

Select the rings variable from planets_df

rings_vector <- planets_df$rings

Print out rings_vector

rings_vector

## [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE

Of course this vector is identical to the rings vector we built in the first place, when construct the planets_df dataframe.

Continue to the next exercise and discover yet another way of subsetting!

Only planets with rings (2)

You probably remember from high school that some planets in our solar system have rings and others do not. Unfortunately you can not recall their names. Could R help you out?

If you type rings_vector in the console, you get:

[1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE

This means that the first four observations (or planets) do not have a ring (FALSE), but the other four do (TRUE). However, you do not get a nice overview of the names of these planets, their diameter, etc. Let’s try to use rings_vector to select the data for the four planets with rings.

Adapt the code to select all columns for planets with rings

planets_df[rings_vector,]

##      name      type diameter rotation rings
## 5 Jupiter Gas giant   11.209     0.41  TRUE
## 6  Saturn Gas giant    9.449     0.43  TRUE
## 7  Uranus Gas giant    4.007    -0.72  TRUE
## 8 Neptune Gas giant    3.883     0.67  TRUE

Notice that we’ve put this logical vector before the comma. This tells the R to display those observations associate with a TRUE value, and not those with a FALSE value.

What would happen if you put rings_vector after the comma? And why? Try it and see!

Using planets_df[rings_vector,] is a rather tedious solution. The next exercise will teach you how to do it in a more concise way.

Only planets with rings but shorter

So what exactly did you learn in the previous exercises? You selected a subset from a data frame (planets_df) based on whether or not a certain condition was true (rings or no rings), and you managed to pull out all relevant data. Pretty awesome! By now, NASA is probably already flirting with your CV ;-).

Now, let us move up one level and use the function subset(). You should see the subset() function as a short-cut to do exactly the same as what you did in the previous exercises.

subset(my_df, subset = some_condition)

The first argument of subset() specifies the data set for which you want a subset. By adding the second argument, you give R the necessary information and conditions to select the correct subset.

The code below will give the exact same result as you got in the previous exercise, but this time, you didn’t need to create the rings_vector!

subset(planets_df, subset = rings)

Select planets with diameter < 1

subset(planets_df, diameter < 1)

##      name               type diameter rotation rings
## 1 Mercury Terrestrial planet    0.382    58.64 FALSE
## 2   Venus Terrestrial planet    0.949  -243.02 FALSE
## 4    Mars Terrestrial planet    0.532     1.03 FALSE

subset(planets_df, rings == TRUE)

##      name      type diameter rotation rings
## 5 Jupiter Gas giant   11.209     0.41  TRUE
## 6  Saturn Gas giant    9.449     0.43  TRUE
## 7  Uranus Gas giant    4.007    -0.72  TRUE
## 8 Neptune Gas giant    3.883     0.67  TRUE

Not only is the subset() function more concise, it is probably also more understandable for people who read your code.

Sorting

Making and creating rankings is one of mankind’s favorite affairs. These rankings can be useful (best universities in the world), entertaining (most influential movie stars) or pointless (best 007 look-a-like).

In data analysis you can sort your data according to a certain variable in the data set. In R, this is done with the help of the function order().

order() is a function that gives you the ranked position of each element when it is applied on a variable, such as a vector for example:

Sort the vector a

a <- c(1000, 10, 100)
order(a)

## [1] 2 3 1

## [1] 1000   10  100

10, which is the second element in a, is the smallest element, so 2 comes first in the output of order(a). 100, which is the third element in a is the second smallest element, so 3 comes second in the output of order(a).

Note that order(a) has not altered a itself.

We can use the output of order(a) to reshuffle a:

Reshuffle a in ascending order

a[order(a)]

## [1]   10  100 1000

Once more, be aware that we haven’t altered a at all. We could, by assigning a[order(a)] to a (or to b if we wanted to keep a intact).

Play around with the order function in the console

Now let’s use the order() function to sort your data frame!

Sorting your data frame

Alright, now that you understand the order() function, let us do something useful with it. You would like to rearrange your data frame such that it starts with the smallest planet and ends with the largest one. This will be a sort on the diameter column.

Use order() to create positions

positions <- order(planets_df$diameter)

Use positions to sort planets_df

planets_df[positions, ]

##      name               type diameter rotation rings
## 1 Mercury Terrestrial planet    0.382    58.64 FALSE
## 4    Mars Terrestrial planet    0.532     1.03 FALSE
## 2   Venus Terrestrial planet    0.949  -243.02 FALSE
## 3   Earth Terrestrial planet    1.000     1.00 FALSE
## 8 Neptune          Gas giant    3.883     0.67  TRUE
## 7  Uranus          Gas giant    4.007    -0.72  TRUE
## 6  Saturn          Gas giant    9.449     0.43  TRUE
## 5 Jupiter          Gas giant   11.209     0.41  TRUE

This exercise concludes the chapter on data frames. Remember that data frames are extremely important in R, you will need them all the time. Another very often used data structure is the list. This will be the subject of the next chapter!

6: Lists

Lists, why would you need them?

Congratulations! At this point in the course you are already familiar with:

Vectors (one dimensional array): can hold numeric, character or logical values. The elements in a vector all have the same data type.
Matrices (two dimensional array): can hold numeric, character or logical values. The elements in a matrix all have the same data type.
Data frames (two-dimensional objects): can hold numeric, character or logical values. Within a column all elements have the same data type, but different columns can be of different data type.

Pretty sweet for an R newbie, right? ;-)

Lists, why would you need them? (2)

A list in R is similar to your to-do list at work or school: the different items on that list most likely differ in length, characteristic, and type of activity that has to be done.

A list in R allows you to gather a variety of objects under one name (that is, the name of the list) in an ordered way. These objects can be matrices, vectors, data frames, even other lists, etc. It is not even required that these objects are related to each other in any way.

You could say that a list is some kind super data type: you can store practically any piece of information in it!

Cool. Let’s get our hands dirty!

Creating a list

Let us create our first list! To construct a list you use the function list():

my_list <- list(comp1, comp2 ...)

The arguments to the list function are the list components. Remember, these components can be matrices, vectors, other lists, …

Vector with numerics from 1 up to 10

my_vector <- 1:10

Matrix with numerics from 1 up to 9

my_matrix <- matrix(1:9, ncol = 3)

First 10 elements of the built-in data frame mtcars

my_df <- mtcars[1:10,]

Construct list with these different elements:

my_list <- list(my_vector, my_matrix, my_df)

Print my_list

my_list

## [[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## [[2]]
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
## 
## [[3]]
##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

Creating a named list

Well done, you’re on a roll!

Just like on your to-do list, you want to avoid not knowing or remembering what the components of your list stand for. That is why you should give names to them:

my_list <- list(name1 = your_comp1, name2 = your_comp2)

This creates a list with components that are named name1, name2, and so on. If you want to name your lists after you’ve created them, you can use the names() function as you did with vectors. The following commands are fully equivalent to the assignment above:

my_list <- list(your_comp1, your_comp2) names(my_list) <- c("name1", "name2")

Adapt list() call to give the components names

my_list <- list(vec = my_vector, mat = my_matrix, df = my_df)

Print out my_list

my_list

## $vec
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $mat
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
## 
## $df
##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

Great! Not only do you know how to construct lists now, you can also name them; a skill that will prove most useful in practice. Continue to the next exercise.

Creating a named list (2)

Being a huge movie fan (remember your job at LucasFilms), you decide to start storing information on good movies with the help of lists.

Start by creating a list for the movie “The Shining”.

Creating mov and act

mov <- "The Shining"
act <- c("Jack Nicholson", "Shelley Duvall", "Danny Lloyd","Scatman Crothers", "Barry Nelson")

Creating rev

scores <- c(4.5, 4, 5)
sources <- c("IMDb1", "IMDb2", "IMDb3")
comments <- c("Best Horror Film I Have Ever Seen", "A truly brilliant and scary film from Stanley Kubrick", "A masterpiece of psychological horror")
rev <- data.frame(scores, sources, comments)

Finish the code to build shining_list

shining_list <- list(moviename = mov, actors = act, reviews = rev)

Wonderful! You now know how to construct and name lists. As in the previous chapters, let’s look at how to select elements for lists. Head over to the next exercise

Selecting elements from a list

Your list will often be built out of numerous elements and components. Therefore, getting a single element, multiple elements, or a component out of it is not always straightforward.

One way to select a component is using the numbered position of that component. For example, to “grab” the first component of shining_list you type

shining_list[[1]]

A quick way to check this out is typing it in the console. Important to remember: to select elements from vectors, you use single square brackets: [ ]. Don’t mix them up!

You can also refer to the names of the components, with [[ ]] or with the $ sign. Two ways of selecting the data frame representing the reviews:

shining_list[["reviews"]]
shining_list$reviews

Besides selecting components, you often need to select specific elements out of these components. For example, with shining_list[[2]][1] you select from the second component, actors (shining_list[[2]]), the first element ([1]). When you type this in the console, you will see the answer is Jack Nicholson.

Print out the vector representing the actors

shining_list$actors

## [1] "Jack Nicholson"   "Shelley Duvall"   "Danny Lloyd"      "Scatman Crothers"
## [5] "Barry Nelson"

Print the second element of the vector representing the actors

shining_list$actors[2]

## [1] "Shelley Duvall"

Great! Selecting elements from lists is rather easy isn’t it? Continue to the next exercise.

Creating a new list for another movie

You found reviews of another, more recent, Jack Nicholson movie: The Departed!

Scores Comments
4.6 I would watch it again
5 Amazing!
4.8 I liked it
5 One of the best movies
4.2 Fascinating plot

It would be useful to collect together all the pieces of information about the movie, like the title, actors, and reviews into a single variable. Since these pieces of data are different shapes, it is natural to combine them in a list variable.

Create movie_title and movie_actors

movie_title <- "The Departed"
movie_actors <- c("Leonardo DiCaprio", "Matt Damon", "Jack Nicholson", "Mark Wahlberg", "Vera Farmiga", "Martin Sheen")

Use the table from the exercise to define the comments and scores vectors

scores <- c(4.6, 5, 4.8, 5, 4.2)
comments <- c("I would watch it again", "Amazing!", "I liked it", "One of the best movies", "Fascinating plot")

Save the average of the scores vector as avg_review

avg_review <- mean(scores)

Combine scores and comments into the reviews_df data frame

reviews_df <- data.frame(scores, comments)

Create and print out a list, called departed_list

departed_list <- list(movie_title, movie_actors, reviews_df, avg_review)
departed_list

## [[1]]
## [1] "The Departed"
## 
## [[2]]
## [1] "Leonardo DiCaprio" "Matt Damon"        "Jack Nicholson"   
## [4] "Mark Wahlberg"     "Vera Farmiga"      "Martin Sheen"     
## 
## [[3]]
##   scores               comments
## 1    4.6 I would watch it again
## 2    5.0               Amazing!
## 3    4.8             I liked it
## 4    5.0 One of the best movies
## 5    4.2       Fascinating plot
## 
## [[4]]
## [1] 4.72

Good work! You successfully created another list of movie information, and combined different components into a single list. Congratulations on finishing the course!

Module 2: Intermediate R

Video: Relational Operators

Equality

The most basic form of comparison is equality. Let’s briefly recap its syntax. The following statements all evaluate to TRUE (feel free to try them out in the console).

Playing around with equalities and inequalities

3 == (2 + 1)

## [1] TRUE

"intermediate" != "r"

## [1] TRUE

TRUE != FALSE

## [1] TRUE

"Rchitect" != "rchitect"

## [1] TRUE

Notice from the last expression that R is case sensitive: “R” is not equal to “r”. Keep this in mind when solving the exercises in this chapter!

Comparison of logicals

TRUE == FALSE

## [1] FALSE

Comparison of numerics

-6 * 14 != 17 - 101

## [1] FALSE

Comparison of character strings

"useR" == "user"

## [1] FALSE

Compare a logical with a numeric

TRUE == 1

## [1] TRUE

Awesome! Since TRUE coerces to 1 under the hood, TRUE == 1 evaluates to TRUE. Make sure not to mix up == (comparison) and = (using for settings in functions). == is what needed to check the equality of R objects.

Greater and less than

Apart from equality operators, Filip also introduced the less than and greater than operators: < and >. You can also add an equal sign to express less than or equal to or greater than or equal to, respectively. Have a look at the following R expressions, that all evaluate to FALSE:

(1 + 2) > 4
"dog" < "Cats"
TRUE <= FALSE

Remember that for string comparison, R determines the greater than relationship based on alphabetical order. Also, keep in mind that TRUE is treated as 1 for arithmetic, and FALSE is treated as 0. Therefore, FALSE < TRUE is TRUE.

Comparison of numerics

-6 * 5 + 2 >= -10 + 1

## [1] FALSE

Comparison of character strings

"raining" <= "raining dogs"

## [1] TRUE

Comparison of logicals

TRUE > FALSE

## [1] TRUE

Make sure to have a look at the console output to see if R returns the results you expected.

Compare vectors

You are already aware that R is very good with vectors. Without having to change anything about the syntax, R’s relational operators also work on vectors. But be careful: the comparison is element-by-element, so the tow vectors must have the same number of elements (i.e. they must be the same length).

Let’s go back to the example that was started in the video. You want to figure out whether your activity on social media platforms have paid off and decide to look at your results for LinkedIn and Facebook. The sample code in the editor initializes the vectors linkedin and facebook. Each of the vectors contains the number of profile views your LinkedIn and Facebook profiles had over the last seven days.

Create the vectors linked and facebook (used in the video)

linkedin <- c(16, 9, 13, 5, 2, 17, 14)
facebook <- c(17, 7, 5, 16, 8, 13, 14)

Popular days

linkedin > 15

## [1]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE

Quiet days

linkedin <= 5

## [1] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE

LinkedIn more popular than Facebook

linkedin > facebook

## [1] FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE

Have a look at the console output. Your LinkedIn profile was pretty popular on the sixth day, but less so on the fourth and fifth day.

Compare matrices

R’s ability to deal with different data structures for comparisons does not stop at vectors. Matrices and relational operators also work together seamlessly!

First we’ll store the LinkedIn and Facebook data in a matrix (rather than in vectors). We’ll call this matrix views. The first row contains the LinkedIn information; the second row the Facebook information. The original vectors facebook and linkedin are still available as well.

Create social media data matrix

views <- matrix(c(linkedin, facebook), nrow = 2, byrow = TRUE)

When does views equal 13?

views == 13

##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]
## [1,] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE

When is views less than or equal to 14?

views <= 14

##       [,1] [,2] [,3]  [,4] [,5]  [,6] [,7]
## [1,] FALSE TRUE TRUE  TRUE TRUE FALSE TRUE
## [2,] FALSE TRUE TRUE FALSE TRUE  TRUE TRUE

This exercise concludes the part on comparators. Now that you know how to query the relation between R objects, the next step will be to use the results to alter the behavior of your programs. Find out all about that in the next video!

Video: Logical Operators

& and |

Before you work your way through the next exercises, have a look at the following R expressions. All of them will evaluate to TRUE:

TRUE & TRUE
FALSE | TRUE
5 <= 5 & 2 < 3
3 < 4 | 7 < 6

Watch out: 3 < x < 7 to check if x is between 3 and 7 will not work; you’ll need 3 < x & x < 7 for that.

In this exercise, you’ll be working with the last variable. We’ll make this variable equal the last value of the linkedin vector that you’ve worked with previously. The linkedin vector represents the number of LinkedIn views your profile had in the last seven days, remember?

Defining the last variable

last <- tail(linkedin, 1)

Is last under 5 or above 10?

last < 5 | last > 10

## [1] TRUE

Is last between 15 (exclusive) and 20 (inclusive)?

last > 15 & last <= 20

## [1] FALSE

Have one last look at the console before proceeding; do the results of the different expressions make sense?

& and | (2)

Like relational operators, logical operators work perfectly fine with vectors and matrices.

Ready for some advanced queries to gain more insights into your social outreach?

linkedin exceeds 10 but facebook below 10

linkedin > 10 & facebook < 10

## [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

When were one or both visited at least 12 times?

linkedin >= 12 | facebook >= 12

## [1]  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE

When is views between 11 (exclusive) and 14 (inclusive)?

views > 11 & views <= 14

##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6] [,7]
## [1,] FALSE FALSE  TRUE FALSE FALSE FALSE TRUE
## [2,] FALSE FALSE FALSE FALSE FALSE  TRUE TRUE

You’ll have noticed how easy it is to use logical operators to vectors and matrices. What do these results tell us? The third day of the recordings was the only day where your LinkedIn profile was visited more than 10 times, while your Facebook profile wasn’t. Can you draw similar conclusions for the other results?

Question: Reverse the result: !

Blend it all together

With the things you’ve learned by now, you’re able to solve pretty cool problems.

Instead of recording the number of views for your own LinkedIn profile, suppose you conducted a survey inside the company you’re working for. You’ve asked every employee with a LinkedIn profile how many visits their profile has had over the past seven days. The data will be stored in the matrix employees_views, then a data fram li_df created from this matrix, with appropriate names for the rows. Finally, names will be added to the columns.

Creating data frame li_df

employees_views <- matrix(c(2, 3, 3, 6, 4, 2, 0, 19, 23, 18, 22, 23, 29, 25, 24, 18, 15, 19, 18, 22, 17, 22, 18, 27, 26, 19, 21, 25, 25, 25, 26, 31, 24, 36, 37, 22, 20, 29, 26, 23, 22, 29, 0, 4, 2, 2, 3, 4, 2, 12, 3, 15, 7, 1, 15, 11, 19, 22, 22, 19, 25, 24, 23, 23, 12, 19, 25, 18, 22, 22, 29, 27, 23, 25, 29, 30, 17, 13,   13, 20, 17, 12, 22, 20, 7, 17, 9, 5, 11, 9, 9, 26, 27, 28, 36, 29, 31, 30, 7, 6, 4, 11, 5, 5, 15, 32, 35, 31, 35, 24, 25, 36, 7, 17, 9, 12, 13, 6, 12, 9, 6, 3, 12, 3, 8, 6, 0, 1, 11, 6, 0, 4, 11, 9, 12, 6, 13, 12, 13, 11, 6, 15, 15, 10, 9, 7, 18, 17,   17, 12, 4, 14, 17, 7, 1, 12, 8, 2, 4, 4, 11, 5, 8, 0, 1, 6, 3, 1, 2, 7, 5, 3, 1, 5, 5, 29, 25, 32, 28,  28, 27, 27, 17, 15, 17, 23, 23, 17, 22, 26, 32, 33, 30, 33,   28, 26, 27, 29, 24, 29, 26, 31, 28, 4, 1, 1, 2, 1, 7, 4, 22, 22, 17, 20, 14, 19, 13, 9, 11, 7, 10, 8, 15, 5,  6, 5, 12, 5,   17, 17, 4, 18, 17, 12, 22, 22, 13, 12, 2, 12, 13, 7, 10, 6, 2, 32, 26, 20, 23, 24, 25, 21,  5, 13, 12, 11, 6, 5, 10,  6, 10, 11, 6, 6, 2, 5, 30, 37, 32, 35, 37, 41, 42, 34, 33, 32, 35, 33,   27,  35,  15, 19, 21, 18, 22, 26, 22, 28, 29, 30, 19, 21, 19, 26,  6, 8, 6, 7, 17, 11, 14, 17, 22, 27, 24, 18, 28, 24,  6, 10, 17, 18,  13, 10, 7, 18, 19, 22, 17, 21, 15, 23, 21, 27, 28, 28, 26, 17, 25, 10, 18, 20, 18, 12, 19, 17,  6, 15, 15, 15, 10, 14, 2, 30, 28, 29, 31, 24, 20, 25), nrow = 50, byrow = TRUE)
li_df <- data.frame(employees_views, row.names = c("employee_1", "employee_2", "employee_3", "employee_4", "employee_5", "employee_6", "employee_7", "employee_8", "employee_9", "employee_10", "employee_11", "employee_12", "employee_13", "employee_14", "employee_15", "employee_16", "employee_17", "employee_18", "employee_19", "employee_20", "employee_21", "employee_22", "employee_23", "employee_24", "employee_25", "employee_26", "employee_27", "employee_28", "employee_29", "employee_30", "employee_31", "employee_32", "employee_33", "employee_34", "employee_35", "employee_36", "employee_37", "employee_38", "employee_39", "employee_40", "employee_41", "employee_42", "employee_43", "employee_44", "employee_45", "employee_46", "employee_47", "employee_48", "employee_49", "employee_50") )
names(li_df)[1] <- "day1"
names(li_df)[2] <- "day2"
names(li_df)[3] <- "day3"
names(li_df)[4] <- "day4"
names(li_df)[5] <- "day5"
names(li_df)[6] <- "day6"
names(li_df)[7] <- "day7"

Select the second column, named day2, from li_df: second

second <- li_df[, 2]

Build a logical vector, TRUE if value in second is extreme: extremes

extremes <- second < 5 | second > 25

Count the number of TRUEs in extremes

sum(extremes)

## [1] 16

Head over to the next video and learn how relational and logical operators can be used to alter the flow of your R scripts.

Video: Conditional Statements

The if statement

Before diving into some exercises on the if statement, have another look at its syntax:

if (condition) {
expr
}

Remember your vectors with social profile views? Let’s look at it from another angle. We create a variable called medium which gives information about the social website, and another called num_views which denotes the actual number of views that particular medium had on the last day of your recordings.

Defining these variables related to your last day of recordings

medium <- "LinkedIn"
num_views <- 14

Examine the if statement for medium

if (medium == "LinkedIn") {
  print("Showing LinkedIn information")
}

## [1] "Showing LinkedIn information"

Write the if statement for num_views

if (num_views > 15) {
  print("You are popular!")
}

Try to see what happens if you change the medium and num_views variables and run your code again. Let’s further customize these if statements in the next exercise.

Add an else

You can only use an else statement in combination with an if statement. The else statement does not require a condition; its corresponding code is simply run if all of the preceding conditions in the control structure are FALSE. Here’s a recipe for its usage:

if (condition) {
expr1
} else {
expr2
}

It’s important that the else keyword comes on the same line as the closing bracket of the if part!

We will now extend the if statements that we coded in the previous exercises with the appropriate else statements!

Control structure for medium

if (medium == “LinkedIn”) { print(“Showing LinkedIn information”) } else { print(“Unknown medium”) }

Control structure for num_views

if (num_views > 15) { print(“You’re popular!”) } else { print(“Try to be more visible!”) }

You also had Facebook information available, remember? Time to add some more statements to our control structures using else if!

Customize further: else if

The else if statement allows you to further customize your control structure. You can add as many else if statements as you like. Keep in mind that R ignores the remainder of the control structure once a condition has been found that is TRUE and the corresponding expressions have been executed. Here’s an overview of the syntax to freshen your memory:

if (condition1) {
expr1
} else if (condition2) {
expr2
} else if (condition3) {
expr3
} else {
expr4
}

Again, it’s important that the else if keywords come on the same line as the closing bracket of the previous part of the control construct.

Control structure for medium

if (medium == "LinkedIn") {
  print("Showing LinkedIn information")
} else if (medium == "Facebook") {
  print("Showing Facebook information")
} else {
  print("Unknown medium")
}

## [1] "Showing LinkedIn information"

Control structure for num_views

if (num_views > 15) {
  print("You're popular!")
} else if (num_views <= 15 & num_views > 10) {
  print("Your number of views is average")
} else {
  print("Try to be more visible!")
}

## [1] "Your number of views is average"

Have another look at the second control structure. Because R abandons the control flow as soon as it finds a condition that is met, you can simplify the condition for the else if part in the second construct to num_views > 10.

Question: Else if 2.0

Take control!

In this exercise, you will combine everything that you’ve learned so far: relational operators, logical operators and control constructs. You’ll need it all!

Define li and fb

li <- 15
fb <- 9

These two variables, li and fb denote the number of profile views your LinkedIn and Facebook profile had on the last day of recordings. Go through the instructions to create R code that generates a ‘social media score’, sms, based on the values of li and fb.

Code the control-flow construct

if (li >= 15 & fb >= 15) {
  sms <- 2 * (li + fb)
} else if (li < 10 & fb < 10) {
  sms <- 0.5 * (li + fb)
} else {
  sms <- li + fb
}

Print the resulting sms to the console

sms

## [1] 24

Feel free to play around some more with your solution by changing the values of li and fb.

2: Loops

Video: While loop

Write a while loop

Let’s get you started with building a while loop from the ground up. Have another look at its recipe:

while (condition) {
expr
}

Remember that the condition part of this recipe should becomeFALSEat some point during the execution. Otherwise, thewhile` loop will go on indefinitely.

If your session expires when you run your code, check the body of your while loop carefully.

Have a look at the code below; it initializes the speed variables and already provides a while loop template to get you started.

Initialize the speed variable

speed <- 64

Code the while loop

while (speed > 30) {
  print("Slow down!")
  speed <- speed - 7
}

## [1] "Slow down!"
## [1] "Slow down!"
## [1] "Slow down!"
## [1] "Slow down!"
## [1] "Slow down!"

Print out the speed variable

speed

## [1] 29

Throw in more conditionals

In the previous exercise, you simulated the interaction between a driver and a driver’s assistant: When the speed was too high, “Slow down!” got printed out to the console, resulting in a decrease of your speed by 7 units.

There are several ways in which you could make your driver’s assistant more advanced. For example, the assistant could give you different messages based on your speed or provide you with a current speed at a given moment.

A while loop similar to the one you’ve coded in the previous exercise is already available in the editor. It prints out your current speed, but there’s no code that decreases the speed variable yet, which is pretty dangerous. Can you make the appropriate changes?

Note that we’ll need to assign the value of 64 to the variable speed, as it currently has the value 29.

Initialize the speed variable

speed <- 64

Extend/adapt the while loop

while (speed > 30) {
  print(paste("Your speed is",speed))
  if (speed > 48) {
    print("Slow down big time!")
    speed <- speed - 11
  } else {
    print("Slow down!")
    speed <- speed - 6
  }
}

## [1] "Your speed is 64"
## [1] "Slow down big time!"
## [1] "Your speed is 53"
## [1] "Slow down big time!"
## [1] "Your speed is 42"
## [1] "Slow down!"
## [1] "Your speed is 36"
## [1] "Slow down!"

To further improve our driver assistant model, head over to the next exercise!

Stop the while loop: break

There are some very rare situations in which severe speeding is necessary: what if a hurricane is approaching and you have to get away as quickly as possible? You don’t want the driver’s assistant sending you speeding notifications in that scenario, right?

This seems like a great opportunity to include the break statement in the while loop you’ve been working on. Remember that the break statement is a control statement. When R encounters it, the while loop is abandoned completely.

Once again, we begin by initialising the speed variable.

Initialize the speed variable

speed <- 88

Adding a break to our while loop

while (speed > 30) {
  print(paste("Your speed is", speed))
  if (speed > 80) {
    break
  }
  if (speed > 48) {
    print("Slow down big time!")
    speed <- speed - 11
  } else {
    print("Slow down!")
    speed <- speed - 6
  }
}

## [1] "Your speed is 88"

R Programming for Business

Iain Webb

Fall 2020

Module 1: Introduction to R

1: Intro to basics

How it works

Calculate 3 + 4

Calculate 6 + 12

Arithmetic with R

An addition

A subtraction

A multiplication

A division

Exponentiation

Modulo

Variable assignment

Assign the value 42 to x

Print out the value of the variable x

Variable assignment (2)

Assign the value 5 to the variable my_apples

Print out the value of the variable my_apples

Variable assignment (3)

Assign a value to the variable my_oranges

Add these two variables together

Create the variable my_fruit

Apples and oranges

Basic data types in R

Change my_numeric to be 42

Change my_character to be “universe”

Change my_logical to be FALSE

What’s the data type?

Check class of my_numeric

Check class of my_character

Check class of my_logical

2: Vectors

Create a vector

Define the variable vegas

Create a vector (2)

Complete the code for boolean_vector

Create a vector (3)

Poker winnings from Monday to Friday

Roulette winnings from Monday to Friday

Naming a vector

Assign days as names of poker_vector

Assign days as names of roulette_vector

Naming a vector (2)

The variable days_vector

Assign the names of the day to roulette_vector and poker_vector

Calculating total winnings

Take the sum of A_vector and B_vector

Print out total_vector

Calculating total winnings (2)

Assign to total_daily how much you won/lost on each day

Display the daily total

Calculating total winnings (3)

Total winnings with poker

Total winnings with roulette

Total winnings overall

Print out total_week

Comparing total winnings

Check if you realized higher total gains in poker than in roulette

Vector selection: the good times

Define a new variable based on a selection

Vector selection: the good times (2)

Define a new variable based on a selection

Vector selection: the good times (3)

Define a new variable based on a selection

Vector selection: the good times (4)

Select poker results for Monday, Tuesday and Wednesday

Calculate the average of the elements in poker_start

Selection by comparison - Step 1

Which days did you make money on poker?

Print out selection_vector

Selection by comparison - Step 2

Select from poker_vector these days

Advanced selection

Which days did you make money on roulette?

Select from roulette_vector these days

3: Matrices

What’s a matrix?