0. Introduction

This course aims to teach the fundamentals of programming with R, and how to use Shiny to build an online interactive dashboard.

Covidtracker.fr is a notorious illustration of an elaborated interactive dashboard covering the COVID-19 pandemic in France. This could be done with R & Shiny. Here is another example of what can be produced with R.

Dashboards are a great way to display metrics, explore a dataset, and visualize data in an interactive fashion (interactive means that the information presented to the user reacts to the user’s inputs).

Shiny provides a web framework for building interactive web applications with R. It works well on computers, tablets, and mobile devices. It allows to develop and publish a data analysis app.

In this course, you will learn how to create and publish and online interactive dashboard with R (Shiny). In 15 hours, we will learn how to create a basic online interactive dashboard. This course guides you through all the steps required to be able to do so. Your assessment will consist in developing your own app / dashboard (as a team of 4).

Please also consider this course as an opportunity to discover R programming language and to open to all the things you can create with it. For instance, the website supporting this course has been developed with R (using the Markdown package).

Finally, notice that this course is not a course where you can go idle until couple of days before the final exam. It requires continuous preparation, dedication, and practice. Trial and error is key. You must try coding by yourself, most likely your code will not work at first, but you will also learn how to find help and solve coding problems when you are stuck.

There are four sessions. The first session covers sections 0 to 3. The second and third sessions cover sections 4 to 6. The fourth session covers sections 7, 8, and 9.

1. R: A programming language

Computers have their own language called machine code, which tells them what to do. A programming language provides an interface between a programmer and the machine language. When you write a code in the R programming language, you write a code that tells the machine what to do. The code is then compiled, which turns it into machine code the computer understands.

R is a programming language and a free software. It possesses an extensive catalog of statistical and graphical methods. It includes machine learning algorithms, linear regression, time series, statistical inference to name a few. R is one of the dominant languages for data analysis in the business and finance industries.

To code with R, we will use RStudio. RStudio is an integrated development environment for R, with a console, syntax-highlighting editor that supports direct code execution, and tools for plotting, history, debugging and workspace management. RStudio should be already installed on all the machines, if not please contact the IT service. For more information about RStudio please visit this link. Notice that you can easily install R and RStudio on you own device, to do so, please check this link.

A great thing with R it that many users have already created so-called packages, which are collections of functions you can use to perform specific tasks. For instance, if you want to create a plot, you do not have to write a function from scratch, you can use a package, in that case ggplot2, which already has functions designed to create nice plots. The only thing you have to learn is what are the arguments these functions require, i.e, how to adequately call these functions.

You will find below a simple illustration of what can be achieved with few R code lines. We want to generate random numbers and then create a histogram of them. As you will see, some code lines start with a #, it means that those are comments, there for you to understand what is expected from the code (the machine is not going to read these lines).

library(ggplot2)
#create a data frame with 50 rows and 1 column
df <- data.frame(matrix(ncol = 1, nrow = 50))
#provide the column name
colnames(df) <- c('random_number')
#populate the column with random numbers
df$random_number <- runif(n = 50, min = 1, max = 10)
#plot a histogram of the random numbers with ggplot
ggplot(data = df, aes(x=random_number)) + geom_histogram(binwidth=0.5, color="black", fill="white") + labs(title="Random numbers histogram plot",x="Random numbers", y = "Count") + theme_dark()

Note that, on this website, you can see the output of the R code, it appears below the code. However, it is strongly recommended to also run it on your own machine, using RStudio, because it is on your machine that you can modify the code.

2. Basics of programming with R

First, let us consider the interface of RStudio. As indicated here, there are four main windows. The script editor is where you type the code. By clicking the run button you can run the code. You can also execute R commands straight in the console window (bottom-left). On the top-right, there is the environment window where all the variables and datasets we create are listed. On the bottom-right, we can navigate files, consult the help, show plots, and explore the list of R packages installed on the machine.

Markdown Logo

Markdown Logo

The basics we cover below are also reviewed in the R file available on Blackboard: Section 2.R. As a good practice, please create a working directory to store R files and the data we will use. Create a sub-directory named “R course” in your “Documents” folder. From RStudio, use the menu to change your working directory under Session > Set Working Directory > Choose Directory. Choose the directory you have just created. If it works as intended, you should see in the Files tab in the bottom-right window of RStudio the files present in the folder you have selected.

2.2. Define a variable

A basic concept in programming is called a variable. A variable allows us to store a value (e.g., 5 or “soccer”) or an object (such as a function). We can then later use this variable’s name to easily access the value or the object that is stored within this variable. To assign a value to a variable, we use “<-” or the sign “=”. Consider the below examples:

my_variable <-  5
print(my_variable)
## [1] 5
my_variable =  5
print(my_variable)
## [1] 5
my_variable <-  "soccer"
print(my_variable)
## [1] "soccer"

At any time, in RStudio you can go to the Menu and click the file tab to save your script. Go save as and then select the folder and type the name of the file. It will create and R file: xxx.R.

2.4. Data types

There are numerous data type:

#Decimal values like 4.5 are floats, they belong to numerics.
dec <- 4.5
#Whole numbers like 4 are called integers. Integers are also numerics.
whole <- 9
#Boolean values (TRUE or FALSE) are called logical.
my_value <- TRUE
#Text (or string) values are called characters.
my_string <- "This is a string"

We can check the type of a variable by using the function class(), for instance:

my_value <- TRUE
class (my_value)
## [1] "logical"

2.5. Arithmetic with R

We can easily perform some arithmetic calculations with R.

my_variable1 <-  5+6
print(paste("The result of the addition is", my_variable1,sep=" "))
## [1] "The result of the addition is 11"
my_variable2 <-  5-6
print(paste("The result of the substraction is", my_variable2,sep=" "))
## [1] "The result of the substraction is -1"
my_variable3 <-  5*6
print(paste("The result of the multiplication is", my_variable3,sep=" "))
## [1] "The result of the multiplication is 30"
my_variable4 <-  5/6
print(paste("The result of the division is", my_variable4,sep=" "))
## [1] "The result of the division is 0.833333333333333"
#Return on equity (ROE) calculation
firm_earnings <- 150
firm_equity <- 1000
firm_ROE <- firm_earnings/firm_equity
print(firm_ROE)
## [1] 0.15

2.6. Relational operators

Relational operator (comparators sur as ==, !=, >, <, >=, <=) help us see how one R object relates (compares) to another (are they equal or unequal for instance). The comparison returns TRUE or FALSE:

#equality ==
print (1==1)
## [1] TRUE
#equality ==
var1=1
var2=2
print (var1==var2)
## [1] FALSE
#inequality !=
print (1!=1)
## [1] FALSE
#inequality !=
print (1!=2)
## [1] TRUE
#less than or greater than
print (1<3)
## [1] TRUE
#less than or greater than
print (1>0)
## [1] TRUE

2.7. Logical operators

A logical operator is a symbol or word used to connect two or more objects (AND, OR, etc…):

#AND(&)
print (15 == 15 & 15> 13)
## [1] TRUE
#AND(&)
print (15 == 15 & 15> 16)
## [1] FALSE
#OR(|)
print (15 == 15 | "A"=="A")
## [1] TRUE
#OR(|)
print (15 == 17 | "A"=="A")
## [1] TRUE

2.8. Conditional statements

A conditional statement is a statement that gives an algorithm the ability to make decision (for instance if it rains then show me the TV program). Below is a if statement stating that if the number of views (we assume those are Instagram views for instance) exceeds 15, then the user can be considered as popular…

num_views<-16
if (num_views > 15) {
  print("You are popular!")
}
## [1] "You are popular!"

We can add an else statement to explicitly consider the case where the number of views does not exceed 15:

num_views<-10
if (num_views > 15) {
  print("You are popular!")
} else {
  print("You are not popular!")
}
## [1] "You are not popular!"

If we want to consider several alternatives, we can use else if statements:

num_views<-16
if (num_views <= 10) {
  print("You are really not popular!")
} else if(num_views > 10 & num_views<= 15) {
  print("You are not popular!")
} else if(num_views > 15 & num_views<= 20) {
  print("You are popular!")
} else if(num_views > 20) {
  print("You are really popular!")
}
## [1] "You are popular!"

2.9. Loops

With loops we can keep having the algorithm doing something while a condition holds. In the below example, we use a while loop that reduces the value of the variable speed as long as the condition (speed value > 30) holds:

speed <- 64
while (speed > 30) {
  print(paste("Slow down! Your speed is:", speed, sep=" "))
  speed <- speed - 7
}
## [1] "Slow down! Your speed is: 64"
## [1] "Slow down! Your speed is: 57"
## [1] "Slow down! Your speed is: 50"
## [1] "Slow down! Your speed is: 43"
## [1] "Slow down! Your speed is: 36"

We can combine a while loop with a break. For instance, in the below example we use a while loop to increase speed if speed is greater than 30. We add a break command, so that when the speed exceeds 80, we break the loop and stop the algorithm.

speed <- 35
while (speed > 30) {
  speed<-speed+3
  print(paste("Your speed is", speed))
  # Break the while loop when speed exceeds 80
  if (speed > 80) {
    print("speed it too high, we stop and break the loop")
    break
  }
}
## [1] "Your speed is 38"
## [1] "Your speed is 41"
## [1] "Your speed is 44"
## [1] "Your speed is 47"
## [1] "Your speed is 50"
## [1] "Your speed is 53"
## [1] "Your speed is 56"
## [1] "Your speed is 59"
## [1] "Your speed is 62"
## [1] "Your speed is 65"
## [1] "Your speed is 68"
## [1] "Your speed is 71"
## [1] "Your speed is 74"
## [1] "Your speed is 77"
## [1] "Your speed is 80"
## [1] "Your speed is 83"
## [1] "speed it too high, we stop and break the loop"

There is also a for loop. The latter does an action for each value of a list of values. For instance, below, we create a list of prime numbers, and we then have our code telling the machine, with a for loop, to print each of the prime number included in that list. Notice that there are several ways to make that happen. We are going to see how to create a list very shortly.

primes <- c(2, 3, 5, 7, 11, 13)
for (p in primes) {
  print(p)
}
## [1] 2
## [1] 3
## [1] 5
## [1] 7
## [1] 11
## [1] 13
for (i in 1:length(primes)) {
  print(primes[i])
}
## [1] 2
## [1] 3
## [1] 5
## [1] 7
## [1] 11
## [1] 13

2.10. Vectors

Vectors are one-dimension arrays that can hold numeric data, character data, or logical data. A vector is a simple tool to store data. For example, you can store your daily gains and losses in casinos. To create a vector containing A, B, and C, we use c(A, B, C). We also refer to such vectors as lists.

numeric_vector <- c(1, 2, 3)
print(numeric_vector)
## [1] 1 2 3
character_vector <- c("a", "b", "c", "d")
print(character_vector)
## [1] "a" "b" "c" "d"

It is important to have a clear view on the data we use. Understanding what each element refers to is therefore essential. To help, we can name the elements of a vector.

character_vector <- c("a", "b", "c", "d")
names(character_vector)<- c("First letter", "Second letter", "Third Letter", "Fourth Letter")
print(character_vector)
##  First letter Second letter  Third Letter Fourth Letter 
##           "a"           "b"           "c"           "d"

We can also perform calculations with vectors:

numeric_vectorA <- c(1, 2, 3)
numeric_vectorB <- c(1, 2, 3)
vector_A_plus_B<-numeric_vectorA+numeric_vectorB
print(vector_A_plus_B)
## [1] 2 4 6

We call use the built-in R function sum() to sum of all the values of the elements present in the vector. This function returns the value of the sum.

numeric_vectorA <- c(1, 2, 3)
sum_element_vector_A <- sum(numeric_vectorA)
print(sum_element_vector_A)
## [1] 6

By the same token, we can compute the mean value of the elements of a vector, by using the built-in R function mean().

numeric_vectorA <- c(1, 2, 3)
mean_element_vector_A <- mean(numeric_vectorA)
print(mean_element_vector_A)
## [1] 2

We can select some elements of a vector only. To do so we use brackets, and refer to the position(s) of the elements in the vector.

numeric_vectorX <- c("X", "XX", "ZZZ")
numeric_vectorX23 <- numeric_vectorX[2:3]
print (numeric_vectorX23)
## [1] "XX"  "ZZZ"

We can go through the elements of a vector using a loop, we now know how to do it. In the below example we create a vector linkedin that stores the number of views for different users. We then name the elements of the vector after the user names. We want to print the names of the users for which the number of views is greater than 5:

#create the vector of number of views
linkedin<-c(10,20,5,0,10,20,30,54)
#create the vector of names (of users) corresponding to the number of views
names(linkedin)<- c("Jean Paul", "Mark", "Helmut", "Brigitte", "Larz", "Claire", "Toto", "Cici")
#show the elements of the vector meeting the condition number of views > 5
print(linkedin > 5)
## Jean Paul      Mark    Helmut  Brigitte      Larz    Claire      Toto      Cici 
##      TRUE      TRUE     FALSE     FALSE      TRUE      TRUE      TRUE      TRUE

Second, for each element (number of views) of the vector, we want to check whether the user can be considered as popular, that is whether the number of views is greater than 10 in that case, using a loop this time:

linkedin<-c(10,20,5,0,10,20,30,54)
names(linkedin)<- c("Jean Paul", "Mark", "Helmut", "Brigitte", "Larz", "Claire", "Toto", "Cici")
#loop through all the elements of the vector (list) linkedin
for (i in 1:length(linkedin)) {
  print(linkedin[i])
  if (linkedin[i] > 10) {
    print("You're popular!")
  } else {
    print("Be more visible!")
  }
}
## Jean Paul 
##        10 
## [1] "Be more visible!"
## Mark 
##   20 
## [1] "You're popular!"
## Helmut 
##      5 
## [1] "Be more visible!"
## Brigitte 
##        0 
## [1] "Be more visible!"
## Larz 
##   10 
## [1] "Be more visible!"
## Claire 
##     20 
## [1] "You're popular!"
## Toto 
##   30 
## [1] "You're popular!"
## Cici 
##   54 
## [1] "You're popular!"

2.11. Matrices

A matrix is a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns. Assume we want to analyze the box office numbers of the Star Wars franchise. For each movie, the first element of each vector indicates the US box office revenue and the second element refers to the Non-US box office.

#We create three vectors of revenues. 
#In each vector, the first number corresponds to the revenue for the US box office, the second number the revenue for the non-US box office.
new_hope <- c(461,314)
empire_strikes <- c(290, 248)
return_jedi <- c(309, 165)

To create a matrix, we call the built-in function matrix(), we must indicate the number of rows we desire. We also set the option byrow to TRUE to fill the matrix by row. That is the vector new_hope will become the first row of the matrix, the vector empire-strikes the second row, and so on. If we want to fill a matrix by column, we must set byrow to FALSE.

# we create a matrix out of the three vectors we already created
star_wars_matrix <- matrix(c(new_hope, empire_strikes, return_jedi), nrow = 3, byrow = TRUE)
print(star_wars_matrix)
##      [,1] [,2]
## [1,]  461  314
## [2,]  290  248
## [3,]  309  165

We then create vectors of regions and titles to name the matrix’s columns and rows. It helps to better see what the numbers are referring to.

regions <- c("US", "non-US")
titles <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")
colnames(star_wars_matrix)<-regions
rownames(star_wars_matrix)<-titles
print(star_wars_matrix)
##                          US non-US
## A New Hope              461    314
## The Empire Strikes Back 290    248
## Return of the Jedi      309    165

Alternatively, we can construct the matrix with the right names for columns and rows from the beginning:

star_wars_matrix <- matrix(c(new_hope, empire_strikes, return_jedi), 
                           nrow = 3, byrow = TRUE,
                           dimnames = list(titles, regions))
print(star_wars_matrix)
##                          US non-US
## A New Hope              461    314
## The Empire Strikes Back 290    248
## Return of the Jedi      309    165

We can perform standard operations on the elements of the matrix, like summing up the values stored in rows or/and columns. Below, we compute the total revenue for US and non-US. To do so, we call the built-in function colSums(), which computes the sum of each column’s values.

total_rev_region=colSums(star_wars_matrix)
print(total_rev_region)
##     US non-US 
##   1060    727

We then compute the total revenue per movie. To do so, we call the built-in function rowSums(), which computes the sum of each row’s values:

total_rev_mov=rowSums(star_wars_matrix)
print(total_rev_mov)
##              A New Hope The Empire Strikes Back      Return of the Jedi 
##                     775                     538                     474

We finally compute the sum of the revenues for all movies and regions, using the function sum():

total_rev=sum(star_wars_matrix)
print(total_rev)
## [1] 1787

To add an extra row or column to the matrix, we use the function rbind() or cbind(), respectively. To add the extra column summing US and non US revenues for each movie, we bind this column with the rest of the matrix, by calling the function cbind().

final_matrix_starwars<-cbind(star_wars_matrix,total_rev_mov)
print(final_matrix_starwars)
##                          US non-US total_rev_mov
## A New Hope              461    314           775
## The Empire Strikes Back 290    248           538
## Return of the Jedi      309    165           474

To add the extra row summing revenues for each region, we bind this row with the rest of the matrix, by calling the built-in function rbind().

final_matrix_starwars<-rbind(star_wars_matrix,total_rev_region)
print(final_matrix_starwars)
##                           US non-US
## A New Hope               461    314
## The Empire Strikes Back  290    248
## Return of the Jedi       309    165
## total_rev_region        1060    727

Similar to vectors, we can use the square brackets to select one or multiple elements from a matrix. We use a comma to separate the rows we want to select from the columns. We go:

my_matrix[row_we_want , column_we_want]

To select non-US movies only, we use [,2], that is all the rows of the original matrix but only the column number two:

print(star_wars_matrix[,2])
##              A New Hope The Empire Strikes Back      Return of the Jedi 
##                     314                     248                     165

To select revenues in both regions for the second movie only (The Empire Strikes Back), we use [2,]:

print(star_wars_matrix[2,])
##     US non-US 
##    290    248

To select US revenues for the two first movies only, we use [1:2,1]:

print(star_wars_matrix[1:2,1])
##              A New Hope The Empire Strikes Back 
##                     461                     290

The standard operators like +, -, /, * work in an element-wise way on matrices. We can divide by 10 all the values stored in the star wars matrix for instance:

print(star_wars_matrix/10)
##                           US non-US
## A New Hope              46.1   31.4
## The Empire Strikes Back 29.0   24.8
## Return of the Jedi      30.9   16.5

We can also multiply two matrices. Let us assume that we have two matrices, one with the ticket prices for each movie and region and another one with the number of visitors for each movie and region. We can compute the revenue per region of each movie:

# We first create ticket price matrices
new_hope <- c(5.0, 5.0)
empire_strikes <- c(6.0, 6.0)
return_jedi <- c(7.0, 7.0)
star_wars_matrix_ticket <- matrix(c(new_hope, empire_strikes, return_jedi), 
                           nrow = 3, byrow = TRUE,
                           dimnames = list(titles, regions))
print(star_wars_matrix_ticket)
##                         US non-US
## A New Hope               5      5
## The Empire Strikes Back  6      6
## Return of the Jedi       7      7
# We then create the number of visitors  matrix
new_hope <- c(1000000, 2000000)
empire_strikes <- c(15000000, 5260000)
return_jedi <- c(42100000, 7000000)
star_wars_matrix_vis <- matrix(c(new_hope, empire_strikes, return_jedi), 
                                  nrow = 3, byrow = TRUE,
                                  dimnames = list(titles, regions))
print(star_wars_matrix_vis)
##                               US  non-US
## A New Hope               1000000 2000000
## The Empire Strikes Back 15000000 5260000
## Return of the Jedi      42100000 7000000
# Then, we multiply the matrices
star_wars_matrix_rev<-star_wars_matrix_ticket*star_wars_matrix_vis
print(star_wars_matrix_rev)
##                                US   non-US
## A New Hope                5000000 10000000
## The Empire Strikes Back  90000000 31560000
## Return of the Jedi      294700000 49000000

2.12. Data frames

Key for any data analysis. A data frame has the variables of a dataset as columns and the observations as rows.The advantage of a data frame over a matrix is that whereas all the elements of a matrix must have the same type, only the elements within a column are required to have the same type for a data frame. Different columns can be of different data type. We can mix strings, booleans, and decimal numbers for instance. Most of the functions we have seen for matrices apply to data frames too.

For illustration purposes, we will work with a data frame that is built-in R. This data frame is mtcars. Data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). For more information about the mtcars dataset please refer to this link. Let us first print the data frame mtcars to see how it looks like:

print(mtcars)
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

It is often useful to show only a small part of the entire dataset to see its structure, this is what achieves the head () function:

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Another method that is often used to get a rapid overview of the data in a data frame is the function str(). It shows you the structure of the data set:

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

We now cover how to create a data frame from scratch. We want to create a data frame of planets, giving some information for each planet of the dataset. We will create a data frame out of vectors. We define a series of vectors and then combine them into a data frame:

# We define the vectors of planet names and planet types
name <- c("Mercury", "Venus", "Earth", 
          "Mars", "Jupiter", "Saturn", 
          "Uranus", "Neptune")
type <- c("Terrestrial planet", 
          "Terrestrial planet", 
          "Terrestrial planet", 
          "Terrestrial planet", "Gas giant", 
          "Gas giant", "Gas giant", "Gas giant")
# Importantly, we define the vector type as a factor, that is as a vector taking a small number of distinct values. A factor variable is a variable used to categorize and store the data. 
type<-factor(type)
diameter <- c(0.382, 0.949, 1, 0.532, 
              11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 
              0.41, 0.43, -0.72, 0.67)
# We create a data frame from the vectors
planets_df <- data.frame(name, type, diameter, rotation)

We explore our data frame:

head(planets_df)
##      name               type diameter rotation
## 1 Mercury Terrestrial planet    0.382    58.64
## 2   Venus Terrestrial planet    0.949  -243.02
## 3   Earth Terrestrial planet    1.000     1.00
## 4    Mars Terrestrial planet    0.532     1.03
## 5 Jupiter          Gas giant   11.209     0.41
## 6  Saturn          Gas giant    9.449     0.43
str(planets_df)
## 'data.frame':    8 obs. of  4 variables:
##  $ name    : chr  "Mercury" "Venus" "Earth" "Mars" ...
##  $ type    : Factor w/ 2 levels "Gas giant","Terrestrial planet": 2 2 2 2 1 1 1 1
##  $ diameter: num  0.382 0.949 1 0.532 11.209 ...
##  $ rotation: num  58.64 -243.02 1 1.03 0.41 ...

We can add a column or a row using the function cbind() and rbind(), respectively. For instance, to add a column about planet rings:

rings_vector<-c(FALSE, FALSE, FALSE, FALSE,  TRUE,  TRUE,  TRUE,  TRUE)
planets_df<-cbind(planets_df,rings_vector)
print(planets_df)
##      name               type diameter rotation rings_vector
## 1 Mercury Terrestrial planet    0.382    58.64        FALSE
## 2   Venus Terrestrial planet    0.949  -243.02        FALSE
## 3   Earth Terrestrial planet    1.000     1.00        FALSE
## 4    Mars Terrestrial planet    0.532     1.03        FALSE
## 5 Jupiter          Gas giant   11.209     0.41         TRUE
## 6  Saturn          Gas giant    9.449     0.43         TRUE
## 7  Uranus          Gas giant    4.007    -0.72         TRUE
## 8 Neptune          Gas giant    3.883     0.67         TRUE

As for vectors and matrices, we can select some elements from a data frame with the help of square brackets. For instance, print out the diameter of Mercury (row 1, column 3):

print(planets_df[1,3])
## [1] 0.382

Print out data for Mars (entire fourth row):

print(planets_df[4,])
##   name               type diameter rotation rings_vector
## 4 Mars Terrestrial planet    0.532     1.03        FALSE

To select a specific column, we can also use its name:

print(planets_df[1:3,"type"])
## [1] Terrestrial planet Terrestrial planet Terrestrial planet
## Levels: Gas giant Terrestrial planet

We can also put a restriction on the row of the data frame we select. For instance, below we show the data frame for planets that have a diameter greater than 1 (and select all the columns):

print(planets_df[planets_df$diameter>1,])
##      name      type diameter rotation rings_vector
## 5 Jupiter Gas giant   11.209     0.41         TRUE
## 6  Saturn Gas giant    9.449     0.43         TRUE
## 7  Uranus Gas giant    4.007    -0.72         TRUE
## 8 Neptune Gas giant    3.883     0.67         TRUE

Show planet names with rings:

planets_df[planets_df$rings_vector==TRUE, "name"]
## [1] "Jupiter" "Saturn"  "Uranus"  "Neptune"

Notice that R has a built-in function to select a subset of a data frame called subset(). See R help for the required arguments and options. For instance, we show below the data only for the planets that have rings:

subset_df<-subset(planets_df, subset = planets_df$rings_vector==TRUE)
print(subset_df)
##      name      type diameter rotation rings_vector
## 5 Jupiter Gas giant   11.209     0.41         TRUE
## 6  Saturn Gas giant    9.449     0.43         TRUE
## 7  Uranus Gas giant    4.007    -0.72         TRUE
## 8 Neptune Gas giant    3.883     0.67         TRUE

we can order the rows of a data frame by the value of a variable. To do so, we use the built-in function order(). For instance, in the below example, we sort planets by diameter.

position<-order(planets_df$diameter)
planets_df[position,]
##      name               type diameter rotation rings_vector
## 1 Mercury Terrestrial planet    0.382    58.64        FALSE
## 4    Mars Terrestrial planet    0.532     1.03        FALSE
## 2   Venus Terrestrial planet    0.949  -243.02        FALSE
## 3   Earth Terrestrial planet    1.000     1.00        FALSE
## 8 Neptune          Gas giant    3.883     0.67         TRUE
## 7  Uranus          Gas giant    4.007    -0.72         TRUE
## 6  Saturn          Gas giant    9.449     0.43         TRUE
## 5 Jupiter          Gas giant   11.209     0.41         TRUE

The order is increasing by default, we can also specify a decreasing order:

position<-order(planets_df$diameter, decreasing=TRUE)
planets_df[position,]
##      name               type diameter rotation rings_vector
## 5 Jupiter          Gas giant   11.209     0.41         TRUE
## 6  Saturn          Gas giant    9.449     0.43         TRUE
## 7  Uranus          Gas giant    4.007    -0.72         TRUE
## 8 Neptune          Gas giant    3.883     0.67         TRUE
## 3   Earth Terrestrial planet    1.000     1.00        FALSE
## 2   Venus Terrestrial planet    0.949  -243.02        FALSE
## 4    Mars Terrestrial planet    0.532     1.03        FALSE
## 1 Mercury Terrestrial planet    0.382    58.64        FALSE

Notice that, at any time, to see the whole content of a dataframe we can go to the right-top window in RStudio and select the Environment tab, there if we double click on the name of our dataframe, RStudio will open a spreadsheet-like window in which we can see the whole content of the dataframe (all the rows, columns, and observations).

2.13. Create a function

A function is a set of statements organized together to perform a specific task. So far, we have relied on pre-existing functions (built in R already), but we can also create our own ones. Let us create a function that checks whether a letter is contained in a string (a word).

To create a function we need to follow a specific syntax. We must name the function and indicate the arguments (variables) it receives between the parentheses, in our case:

  1. letter: the letter we are looking for
  2. word: the word to search in

We also must indicate what the function does within curly brackets. The function checks whether the word provided by the user has the desired letter in it and let the user now by printing the information. We rely on the built-in function grepl() to know whether the letter is present in the word. For more information, refer to the help on grepl() in R or here.

my_function <- function(word,letter) {
test<-grepl(letter, word) 
  if (test == TRUE) {
  print(paste("The word",word, "has the letter",letter,"in it.",sep=" "))
  }
  else
  {
  print(paste("The word",word, "does not have the letter",letter,"in it.",sep=" "))
  }
} 

We now call our function to verify it works as expected:

my_function("alphabet","a")
## [1] "The word alphabet has the letter a in it."
my_function("barbecue","z")
## [1] "The word barbecue does not have the letter z in it."

2.14. If you are stuck…

You are about to tackle a series of exercises. When coding on your own, you will make mistakes and get stuck, that is how one learns programming. When this happens, I recommend going through these steps: + Most of the time functions or variables are misspelled, first thing to double check. + When you are stuck, read the error messages, they give you clues, google them if needed. + Use R help to see which functions are available and what they do, how to call them properly, what type of arguments they require and so on… + Search for alternative ways to achieve the same goal. + Make use of the print() function to see what your variables contain, whether this is what you expect. + If a solution file is provided, carefully compare your code to it to spot your mistakes or clarify a potential misunderstanding. + R is open-source. Sharing coding issues and providing solutions is central to the community. Feel free to refer to forums and other R websites to see whether someone already encountered the issue you face and check the answers provided by the R community.

General way to go: turn a problem into small logical steps, and for each step, do some research on the best way to achieve it with R. For instance, if we want a function that compares two strings to know whether they match, we google it and find the right package/function to do so and then use R help to get to know how to use these functions.

A sustainable and efficient way to learn coding is to develop the ability to find help on the internet and correct our mistakes by oureselves. When we work on our own application or develop one for a company, we cannot expect someone to help us every time we are stuck.

2.15. Exercises

To practice, please consider the following exercises. You should find your own way to solve the problems but a solution is provided for guidance - to see it you have to click the code button below each exercise.

Exercise 1

You are going to work with the data frame mtcars. Recall that it presents, in a data frame, data extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). You are asked to rank the automobiles by the variable mpg from the highest to the lowest. Then, find the names of the automobiles that have a cylinder (variable cyl) equal or above six only. Finally, compute and print the mean value for each numeric variables of the data frame.

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Exercise 1 - Possible solution

#order the automobiles by mpg, highest values first
print(mtcars[order(-mtcars$mpg),])
#show automobiles with cylinder >= 6
print(mtcars[mtcars$cyl>=6,])
#compute mean value for each variable and print it
means_of_col <- colMeans(mtcars) 
print(means_of_col)

Exercise 2

Create a function that shows the result of a multiplication by 10 of the number you give as an argument.

Exercise 2 - Possible solution

my_function <- function(number_to_multiply) {
  result <- number_to_multiply*10
  print(result)
  }
my_function(50)

Exercise 3

Find the R function that converts all of the characters in a string to upper case and create a function that takes as argument a word and print it in uppercase letters.

Exercise 3 - Possible solution

my_function <- function(word) {
  uppercase_word <- toupper(word)
  print(uppercase_word)
  }
my_function("test")

Exercise 4

Create a function that prints 5 letters (randomly choosen) out of the word you give as argument.

Exercise 4 - Possible solution

my_function <- function(word) {
  
  #we have done some research, we will use the built-in function runif(). This function generates five random numbers between 1 and the length of the word we give to the function. We round these numbers to get integers.
  vector_random_number <- round(runif(5, 1, nchar(word)))
  print("The five random letters out the word provided are:")
  
  for (i in 1:5) {
    # we use the function substr() to extract the letter for the random position corresponding to the element i of the vector of random numbers.
    letter_to_show <- substr(word,vector_random_number[i], vector_random_number[i])
    print(letter_to_show)
  }
}
my_function("abcdefghijklmnopqrstuvwxyz")

3. Creating a financial model

3.1. Loan simulator

A first usage of R, when it comes to finance, is to design simple financial models. We want to create a mortgage calculator. To build such a model, we need financial functions and a user-friendly interface. We want the model to use some user’s inputs (interest rate, home price, down payment, maturity) to generate some outputs (monthly payment, principal paid, interest paid, total paid).

As a benchmark, consider this mortgage calculator for instance.

We assume we know how to calculate the annuity of a mortgage (monthly payment).

Let us assume that Rose wants to buy a house that costs 100,000 euros. The annual interest rate is 5% a year. She already has some cash for a down payment (50,000 euros), which means that she will have to borrow only 50,000 euros. She wants to borrow money with a maturity of 10 years.

Recall that, by construction, for a loan, the amount borrowed (principal) has to be equal to the present value of all the monthly payments of the borrower:

\[ Principal = Monthly \ Payment \times (1 − (1 + Monthly \ Interest) ^ {-Number \ Months})/ Monthly \ Interest)\]

How do we find the monthly payment? It is equal to (we derive it from the annuity formula):

\[ Monthly \ Interest = 0.05/12 =0.004166 \] \[ Number \ Months = 12 \times 10 = 120 \] \[ Monthly \ Payment = 50000/((1−1.004166^{−120})/0.004166)=530.31\]

Once we have calculated the monthly payment, we can extrapolate how much the borrower will pay to the bank in total:

\[Total\ Payment = 120 \times 530.31 = 63,637\]

Then, we can separate what corresponds to the payment of the principal to what corresponds to the payment of the interests:

\[Principal \ Paid = 50,000\] \[Total \ Interest \ Paid = 63637-50000=13,637\]

We implement a first simple version of the model in the R file available on Blackboard mortgage_simulator.R.

Now, in R, ideally, we want a package to make the annuity calculation, so that it saves us some heavy computation. We also want the user to specify its input values. We are going to see how to make this two things possible now.

3.2. Using a package

Packages in R are a set of R functions, compiled code, and sample data. These are stored under a directory called “library” within the R environment. By default, R installs a group of packages during installation. Once we start the R console, only the default packages are available by default. Other packages that are already installed need to be loaded explicitly to be utilized by the R program. Users have created packages for everything (financial functions, machine learning, web scrapping, creating an app …).

In this subsection, we are going to install and import two packages we need. One that gives us acces to a set of financial functions (FinCal) and another one that allows us to take user inputs (svDialogs). More specifically, in the FinCal package, we are interested in functions that can help us to find the value of an annuity, and in the svDialogs package, we are interested in functions that allows the user to enter her own values for the inputs of our model (house price, interest rate, down payment, maturity).

For more information about both packages, please visit this link and this one.

To install the packages, we use the commands (in the console):

install.packages("svDialogs")
install.packages("FinCal")

Then, to be able to use the functions in this package, we load into our package library the packages at the beginning of the script by calling the function library().

library(svDialogs)
library(FinCal)

We can now use financial functions from FinCal to simplify the computations in our model. We can also receive the inputs from the user using svDialogs. See the R file available on Blackboard mortgage_simulator_V2.R.

3.3. Our first Shiny app

In this course, we are driven by the creation of an online interactive application. We conclude this section by generating an online interactive application using the package required to create online applications: Shiny. More information about the package can be found here.

The R file that creates the app and allows us to publish it online is available on Blackboard mortgage_simulator_app.R.Install the Shiny package before running it. A version is deployed online there.

For the moment, we just want to make sure that you can run this app on your machine, we will have plenty of time later on to describe what the code does and how to use Shiny’s functions to create an app. We will also see how to publish the app.

Command to install the package:

install.packages("Shiny")

Your task for the next time is to read the documentation about the Shiny package and to try to understand what the code does. if you understands it correctly you should be able to publish, at least locally, the pension model presented in section 3.4.

3.4. To practice

Now that you know how to implement a financial model in R, feel free to implement the following pension simulator to practice.

Find below the information needed to build a pension simulator.

Create a model in which a user can select its current age and simulate how much she could get from a retirement account. We assume a retirement age of 62. We assume that, at the beginning of each year until the retirement age, she will make a deposit into the retirement account (we assume this amount to be constant over time). The user has to indicate how much she thinks the account will earn per year (for instance 3%). After retirement at age 62, the user must indicate how many years more she anticipates living to enjoy the money she sets aside on her retirement account. The model will tell the user how much she can withdraw yearly over the duration of her retirement. The account balance will continue to earn the same interest as before.

To help you designing the model, find below a pen & paper solution:

Let us use the following inputs:

  • Current age: 25
  • Annual deposits on the retirement account: 10,000€
  • Annual interest rate earned on the retirement account: 4%
  • Life expectancy: 85
  • Retirement age: 62

In finance terms, we want to find the periodic payment of an annuity. The latter starts somewhere in the future after the retirement account has been loaded with money and have earned some interest.

We first compute the future value of the retirement account by the time the user will reach her retirement age: \[ PV(Retirement \ Account) = 10000((1-(1/(1.04)^{62−25}))/0.04) = 191,426 \] \[ FV(Retirement \ Account) = 191426(1.04)^{62-25} = 817,023 \]

Then, we want to know how much the user can withdraw from that account each year once she will be 62, knowing that she will withdraw money yearly until the age of 85. \[ PV(Withdrawals) = C((1−(1/(1.04)^{85−62}))/0.04) = 817,023 \] \[ C = 817023/(((1−(1/(1.04)^{85−62}))/0.04)) = 54,993 \]

Based on these inputs, the user can expect a yearly pension of 54,993 euros.

You must build a financial model in R based on this set of instructions. A solution file is available on Blackboard under pension_calculator_app.R, deployed online there.

4. Exploring the data

Usually, when it comes to data analysis, we start from a dataset, typically a csv file that we import into R. It is standard to import it as a dataframe, as it is the best way to store variables. In this section, we will see how to find the data we want, import it into R, format it, and then explore it to answer some relevant questions. At first, we will work on COVID datasets. An R file reviewing what we do below is available on Blackboard under Section 4.R.

4.1. Sourcing and importing the data

Let us assume we want to work on COVID data for France. We usually have two options. We can download the data first on our local computer, usually in csv format and then import it to R, or download them directly with R from the Internet source.

In terms of standard open data sources, we can use for instance:

  1. Kaggle offers free data sets to interested data scientists
  2. For French public data, including those on COVID, data.gouv.fr is the platform on which public data is made available.
  3. Harvard College Open Data Project gathered dozens of public Harvard datasets in a free repository.

We go to data.gouv.fr and download data on COVID-related hospitalizations, reanimations, and deaths in France.

We want to download the file: “donnees-hospitalieres-covid19-2021-09-02-19h05.csv”. The link is the following.

Using R, we can download the file directly into a data frame and then explore it. We use the read.csv() function of R.

new_data_frame <- read.csv(file="https://www.data.gouv.fr/fr/datasets/r/63352e38-d353-4b54-bfd1-f1b3ee1cabd7", header=T, sep=";",  encoding="latin1")

Alternatively, we can first download the file on our machine, rename it covid_data.csv, put it in our working directory, and then create a data frame using this csv file. This is a better practice:

covid_dataset <- read.csv("covid_data.csv", header=T, sep=";", encoding="latin1")

You can find the csv data file on Blackboard, under the name covid_data.csv. If you successfuly load the data, you should see it the global enironment window (top-right window in RStudio). If needed, we can explore the entire dataset by double clicking on the name of the variable string the dataset in the environment window.

Information about the variables of this dataset are available in the readme file attached to the data, available here. As you can read, the dataset includes, among others, the following variables:

  1. dep is an integer and code for the department number.
  2. sexe is an integer that code for whether the individuals are males or females.
  3. jour is the date of notice.
  4. hosp is the number of people currently hospitalized.
  5. rea is the number of people currently in resuscitation or critical care.
  6. rad is the total amount of patient that returned home.
  7. dc is the total amount of deaths at the hospital.

We first check what the dataset looks like using the two functions introduced previously: head() and str():

head(covid_dataset)
##   dep sexe       jour hosp rea HospConv SSR_USLD autres rad dc
## 1  01    0 2020-03-18    2   0       NA       NA     NA   1  0
## 2  01    1 2020-03-18    1   0       NA       NA     NA   1  0
## 3  01    2 2020-03-18    1   0       NA       NA     NA   0  0
## 4  02    0 2020-03-18   41  10       NA       NA     NA  18 11
## 5  02    1 2020-03-18   19   4       NA       NA     NA  11  6
## 6  02    2 2020-03-18   22   6       NA       NA     NA   7  5
str(covid_dataset)
## 'data.frame':    220210 obs. of  10 variables:
##  $ dep     : chr  "01" "01" "01" "02" ...
##  $ sexe    : int  0 1 2 0 1 2 0 1 2 0 ...
##  $ jour    : chr  "2020-03-18" "2020-03-18" "2020-03-18" "2020-03-18" ...
##  $ hosp    : int  2 1 1 41 19 22 4 1 3 3 ...
##  $ rea     : int  0 0 0 10 4 6 0 0 0 1 ...
##  $ HospConv: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ SSR_USLD: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ autres  : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ rad     : int  1 1 0 18 11 7 1 0 1 2 ...
##  $ dc      : int  0 0 0 11 6 5 0 0 0 0 ...

4.2. Formatting the data

We create a date out of the time information provided, only a date formatted this way will be recognized by R. To do so, we use the as.Date() function, see documentation here.

covid_dataset$date<-as.Date(covid_dataset$jour, tryFormats = c("%Y-%m-%d"))
head(covid_dataset$date)
## [1] "2020-03-18" "2020-03-18" "2020-03-18" "2020-03-18" "2020-03-18"
## [6] "2020-03-18"

We drop the variables we are not interested in. To do so, we use the built-in function subset().

covid_dataset <- subset(covid_dataset, select = c (dep,sexe,hosp,rea,rad,dc,date))
head(covid_dataset)
##   dep sexe hosp rea rad dc       date
## 1  01    0    2   0   1  0 2020-03-18
## 2  01    1    1   0   1  0 2020-03-18
## 3  01    2    1   0   0  0 2020-03-18
## 4  02    0   41  10  18 11 2020-03-18
## 5  02    1   19   4  11  6 2020-03-18
## 6  02    2   22   6   7  5 2020-03-18

We change some column names so that they are more telling (the names of the variables of the data frame):

colnames(covid_dataset)<-c("department","gender","nb_hospitalizations","nb_reanimations","nb_returned_home","nb_deaths","date")
head(covid_dataset)
##   department gender nb_hospitalizations nb_reanimations nb_returned_home
## 1         01      0                   2               0                1
## 2         01      1                   1               0                1
## 3         01      2                   1               0                0
## 4         02      0                  41              10               18
## 5         02      1                  19               4               11
## 6         02      2                  22               6                7
##   nb_deaths       date
## 1         0 2020-03-18
## 2         0 2020-03-18
## 3         0 2020-03-18
## 4        11 2020-03-18
## 5         6 2020-03-18
## 6         5 2020-03-18

We want to indicate that the gender variable is a factor variable, because we do not want to treat its values as integers (numerical values). Recall that a factor variable is a vector taking a small number of distinct values. A factor variable is a variable used to categorize and store the data. We know from the information file on data.gouv.fr that 0 stands for males & females, 1 stands for males only, and 2 for females only.

covid_dataset$gender<-factor(covid_dataset$gender,labels=c("males & females","males","females"))
head(covid_dataset)
##   department          gender nb_hospitalizations nb_reanimations
## 1         01 males & females                   2               0
## 2         01           males                   1               0
## 3         01         females                   1               0
## 4         02 males & females                  41              10
## 5         02           males                  19               4
## 6         02         females                  22               6
##   nb_returned_home nb_deaths       date
## 1                1         0 2020-03-18
## 2                1         0 2020-03-18
## 3                0         0 2020-03-18
## 4               18        11 2020-03-18
## 5               11         6 2020-03-18
## 6                7         5 2020-03-18

4.3. Summary statistics

We can generate summary statistics of the data by calling the built-in function summary():

summary(covid_dataset)
##   department                    gender      nb_hospitalizations
##  Length:220210      males & females:73644   Min.   :   0.0     
##  Class :character   males          :73644   1st Qu.:  17.0     
##  Mode  :character   females        :72922   Median :  51.0     
##                                             Mean   : 112.2     
##                                             3rd Qu.: 129.0     
##                                             Max.   :3281.0     
##  nb_reanimations  nb_returned_home   nb_deaths           date           
##  Min.   :  0.00   Min.   :    0    Min.   :   0.0   Min.   :2020-03-18  
##  1st Qu.:  2.00   1st Qu.:  271    1st Qu.:  58.0   1st Qu.:2020-09-14  
##  Median :  6.00   Median :  771    Median : 179.0   Median :2021-03-13  
##  Mean   : 16.73   Mean   : 1737    Mean   : 381.7   Mean   :2021-03-13  
##  3rd Qu.: 17.00   3rd Qu.: 1972    3rd Qu.: 455.0   3rd Qu.:2021-09-10  
##  Max.   :855.00   Max.   :34829    Max.   :5662.0   Max.   :2022-03-09

4.4. Analyzing the data with dplyr

Now we have formatted the dataset in a data frame, we can start exploring it in more details. Let us say that we want to know and show the total number of hospitalizations, deaths, or people in critical cares by department or at a specific date.

To do so, we are going to use a package of R that has been created to analyze data frames. The package is dplyr, you can find more information about it here. If the package is not already installed on your machine. You need to install it.

To do so, use the command:

install.packages("dplyr")

Then, to use the functions in this package, we load into our package at the beginning of the script by calling the function library().

library(dplyr)

The main functions of the package are:

  1. select() : Select certain columns of data.
  2. filter() : Filter your data to select specific rows.
  3. arrange() : Arrange the rows of your data into an order.
  4. mutate() : Mutate your data frame to contain new columns, useful to crate new variables.
  5. summarize() : Summarize chunks of you data in some way.
  6. group_by() : group data by a variable, ideally a factor variable.

dplyr works with so called pipes (%>%). Pipes take the output from one function and feed it to the first argument of the next function. This way, we can combine the above functions, as we will see shortly.

To select some variables of a data frame, we can call the dplyr function select(), and using pipes, we go:

#create a subset of the dataframe with some variables only; for instance only the number of hospitalizations and the departments
subset_covid_dataset <- covid_dataset %>% select(c(nb_hospitalizations,department))
head(subset_covid_dataset)
##   nb_hospitalizations department
## 1                   2         01
## 2                   1         01
## 3                   1         01
## 4                  41         02
## 5                  19         02
## 6                  22         02

If we want to show the data for one department only, let us say the department “44”, we can use filter():

#we can show the data onky for one value of a given variable by using filter
filtered_covid_dataset <- covid_dataset %>% filter(department == "44")
head(filtered_covid_dataset)
##   department          gender nb_hospitalizations nb_reanimations
## 1         44 males & females                  18               3
## 2         44           males                  14               3
## 3         44         females                   4               0
## 4         44 males & females                  24               6
## 5         44           males                  17               4
## 6         44         females                   7               2
##   nb_returned_home nb_deaths       date
## 1                7         1 2020-03-18
## 2                3         0 2020-03-18
## 3                4         1 2020-03-18
## 4                9         1 2020-03-19
## 5                5         0 2020-03-19
## 6                4         1 2020-03-19

We can combine these functions to show, for the department 44 and males only, the number of hospitalization as of end of March 2021.

print(covid_dataset %>% filter(department == "44" & date=="2021-03-31" & gender=="males") %>% select(nb_hospitalizations))
##   nb_hospitalizations
## 1                 147

We can also group the data by gender categories and sum across departments, using group_by() and summarize():

new_dataset <-  covid_dataset %>%  filter(date=="2021-03-31")  %>% select(c(nb_hospitalizations,gender)) %>% group_by(gender)  %>% summarize(sum_nb_hosp_by_gender=sum(nb_hospitalizations))
head(new_dataset)
## # A tibble: 3 x 2
##   gender          sum_nb_hosp_by_gender
##   <fct>                           <int>
## 1 males & females                 28463
## 2 males                           14606
## 3 females                         13566

We can create new variables that are computed from the existing ones with mutate():

covid_dataset <- covid_dataset %>%  mutate(ratio_hospitalizations_to_deaths=nb_hospitalizations/nb_deaths)
head(covid_dataset,10)
##    department          gender nb_hospitalizations nb_reanimations
## 1          01 males & females                   2               0
## 2          01           males                   1               0
## 3          01         females                   1               0
## 4          02 males & females                  41              10
## 5          02           males                  19               4
## 6          02         females                  22               6
## 7          03 males & females                   4               0
## 8          03           males                   1               0
## 9          03         females                   3               0
## 10         04 males & females                   3               1
##    nb_returned_home nb_deaths       date ratio_hospitalizations_to_deaths
## 1                 1         0 2020-03-18                              Inf
## 2                 1         0 2020-03-18                              Inf
## 3                 0         0 2020-03-18                              Inf
## 4                18        11 2020-03-18                         3.727273
## 5                11         6 2020-03-18                         3.166667
## 6                 7         5 2020-03-18                         4.400000
## 7                 1         0 2020-03-18                              Inf
## 8                 0         0 2020-03-18                              Inf
## 9                 1         0 2020-03-18                              Inf
## 10                2         0 2020-03-18                              Inf

Of course this cannot be computed for a number of deaths of 0, we can drop instances were our division will fail:

covid_dataset_alt <- covid_dataset %>% filter(nb_deaths>0)  %>% mutate(ratio_hospitalizations_to_deaths=nb_hospitalizations/nb_deaths)
head(covid_dataset_alt,10)
##    department          gender nb_hospitalizations nb_reanimations
## 1          02 males & females                  41              10
## 2          02           males                  19               4
## 3          02         females                  22               6
## 4          06 males & females                  25               1
## 5          06         females                  10               0
## 6          11 males & females                   8               7
## 7          11           males                   6               5
## 8          11         females                   2               2
## 9          13 males & females                  98              11
## 10         13           males                  50               6
##    nb_returned_home nb_deaths       date ratio_hospitalizations_to_deaths
## 1                18        11 2020-03-18                         3.727273
## 2                11         6 2020-03-18                         3.166667
## 3                 7         5 2020-03-18                         4.400000
## 4                47         2 2020-03-18                        12.500000
## 5                28         2 2020-03-18                         5.000000
## 6                 9         3 2020-03-18                         2.666667
## 7                 2         2 2020-03-18                         3.000000
## 8                 7         1 2020-03-18                         2.000000
## 9                59         4 2020-03-18                        24.500000
## 10               35         2 2020-03-18                        25.000000

Finally, we can arrange the data according to the value of one variable using arrange(). For instance, to show where the number of hospitalizations has been the highest for males and females together as of end of March 2021:

head(covid_dataset %>% filter(gender=="males & females"  & date=="2021-03-31")  %>% select(c(department,nb_hospitalizations)) %>% arrange(-nb_hospitalizations))
##   department nb_hospitalizations
## 1         75                1521
## 2         59                1406
## 3         13                1355
## 4         92                1113
## 5         93                1059
## 6         69                 953

4.5. Practice

  1. Show, for females only, the number of reanimations as of end of 2020. A possible solution code is hidden below.
exo_1 <- covid_dataset %>%  filter(gender=="females" & date=="2020-12-31") %>% select(c(nb_reanimations)) 
head(exo_1,10)
  1. Show the time series of total death for males & females for the department 44 for all the dates. A possible solution code is hidden below.
exo_2 <- covid_dataset %>%  filter(gender=="males & females" & department==44) %>% select(c(date,nb_deaths))
head(exo_2,10)
  1. By department, over the entire period, show the ratio of the number of male deaths to the the number of female deaths. A possible solution code is hidden below.
exo_3 <- covid_dataset %>%  filter(gender!="males & females") %>% select(c(department,nb_deaths,gender)) %>% group_by(department,gender) %>% summarize(mysum=sum(nb_deaths))
exo_3 <- exo_3  %>% group_by(department) %>% summarize(ratio = mysum[gender=="males"]/mysum[gender=="females"])
head(exo_3,10)
  1. Show the ten departments with the lowest ratio computed in exo 3. A possible solution code is hidden below.
exo_4 <- exo_3  %>% arrange(ratio)
head(exo_4,10)

4.6. Answering questions

Now that we are better are exploring the data, we can answer the following questions:

  1. What is the month with the highest number of hospitalizations?
  2. Are males more affected than females?
  3. In which department is COVID the dealiest?

A. To know in which month there is the highest number of hospitalizations, we need to do a couple of things. We will use select() to concentrate on the number of hospitalizations, we use filter() to concentrate on end-of-the-month days, and we use summarize() to sum the number of hospitalizations across departments for each month. Then, we sort the information to find the month that records the highest number of hospitalizations.

covid_dataset$mon_yr = format(covid_dataset$date, "%Y-%m") 
first_step <- covid_dataset %>% group_by(mon_yr)  %>%  filter(gender=="males & females" & date == max(date))
head(first_step) 
## # A tibble: 6 x 9
## # Groups:   mon_yr [1]
##   department gender  nb_hospitalizat~ nb_reanimations nb_returned_home nb_deaths
##   <chr>      <fct>              <int>           <int>            <int>     <int>
## 1 01         males ~               72              16               33        10
## 2 02         males ~              144              37              108        51
## 3 03         males ~               36              13               41         5
## 4 04         males ~               20               4               31         2
## 5 05         males ~               42               7               36         1
## 6 06         males ~              132              34              124        23
## # ... with 3 more variables: date <date>,
## #   ratio_hospitalizations_to_deaths <dbl>, mon_yr <chr>
second_step <- first_step %>%  select(nb_hospitalizations,mon_yr)  %>% summarize(total_hosp=sum(nb_hospitalizations))  %>% arrange(-total_hosp)
head(second_step) 
## # A tibble: 6 x 2
##   mon_yr  total_hosp
##   <chr>        <int>
## 1 2022-01      32374
## 2 2021-04      28930
## 3 2021-03      28463
## 4 2020-11      28229
## 5 2021-01      27573
## 6 2020-04      26192

The month with the highest number of hospitalizations is January 2022.

B. We report the total number of deaths as of the last day of our sample period for both males and females and conclude:

answer_B <- covid_dataset %>% filter(gender!="males & females" & date==max(date)) %>% select(c(nb_deaths,gender)) %>% group_by(gender)  %>% summarize(total_deaths=sum(nb_deaths))
print(answer_B)
## # A tibble: 2 x 2
##   gender  total_deaths
##   <fct>          <int>
## 1 males          63758
## 2 females        46917

The total number of deaths is much higher for males.

print(answer_B[1,2]/answer_B[2,2])
##   total_deaths
## 1     1.358953

Males die 1.36 times more from COVID than females.

C. To answer this question, we find out the total number of deaths per department as of the end of our sample period, then we sort by number of deaths.

answer_C <- covid_dataset %>% filter(gender=="males & females" & date==max(date)) %>% select(c(nb_deaths,department))
head(answer_C %>% arrange(-nb_deaths))
##   nb_deaths department
## 1      5662         75
## 2      5243         13
## 3      4796         59
## 4      4009         69
## 5      3596         94
## 6      3303         92

The department 75 has the highest number of deaths.

4.7. Adding data

The above answers are not complete because we ignore what is the population by department and the demographics per department (number of males, females, and class of ages). WouDold our previous conclusions hold if we consider this information? It is much more intersting to know, for 100 inhabitants what the infection rate or the death rate is. Also it would be better to now the department names (more informative than numbers).

We wish to add the population per department to our dataset to be able to scale the number of deaths or hospitalizations. To that end, we download the data from INSEE. The file is available on Blackboard under pop_2021.csv (available on Blackboard). This file also includes the department names.

pop_data <- read.csv("pop2021.csv", sep=";")
#change the column names of the data frame
names(pop_data) <- c("department","name_dep","pop")
head(pop_data)
##   department                name_dep       pop
## 1         01                     Ain   662 244
## 2         02                   Aisne   525 503
## 3         03                  Allier   331 745
## 4         04 Alpes-de-Haute-Provence   165 702
## 5         05            Hautes-Alpes   140 022
## 6         06         Alpes-Maritimes 1 089 270

Because the population data is coded as a string with some spaces between numbers, R cannot interpret it as an integer. We need to remove spaces between numbers - to do so we nee to load the package stringr, and, in particular, use the function str_replace_all(). Once we have removed the spaces, we can convert the strings into numerical values, using the R function as.numeric():

library(stringr)
pop_data$pop<-str_replace_all(pop_data$pop, pattern=" ", repl="")
pop_data$pop<-as.numeric(pop_data$pop)
head(pop_data)
##   department                name_dep     pop
## 1         01                     Ain  662244
## 2         02                   Aisne  525503
## 3         03                  Allier  331745
## 4         04 Alpes-de-Haute-Provence  165702
## 5         05            Hautes-Alpes  140022
## 6         06         Alpes-Maritimes 1089270

Next, we merge the two datasets we have. That is, we add the departments’ populations and names to the COVID dataset. To do so, we use the dplyr function merge(). More information on the function can be found here, pay attention to the syntax:

merged_covid_dataset <- merge(covid_dataset, pop_data, by.x = "department", by.y = "department")
head(merged_covid_dataset)
##   department          gender nb_hospitalizations nb_reanimations
## 1         01 males & females                 194              22
## 2         01           males                 107              20
## 3         01 males & females                  85               4
## 4         01           males                 112              23
## 5         01 males & females                 168              20
## 6         01 males & females                 184              22
##   nb_returned_home nb_deaths       date ratio_hospitalizations_to_deaths
## 1             2493       575 2021-04-16                        0.3373913
## 2             1260       346 2021-04-23                        0.3092486
## 3             2865       621 2021-06-25                        0.1368760
## 4             1239       343 2021-04-20                        0.3265306
## 5             2664       593 2021-05-04                        0.2833052
## 6             2630       591 2021-05-01                        0.3113367
##    mon_yr name_dep    pop
## 1 2021-04      Ain 662244
## 2 2021-04      Ain 662244
## 3 2021-06      Ain 662244
## 4 2021-04      Ain 662244
## 5 2021-05      Ain 662244
## 6 2021-05      Ain 662244

We can now refine our answers to the previous questions, based on the number of hospitalizations or deaths per inhabitants at the department level:

  1. In which department has COVID been the deadliest?

C. We make a change to account for the population:

answer_C <- merged_covid_dataset %>% filter(gender=="males & females" & date==max(date)) %>% select(c(nb_deaths,name_dep, pop))  %>% mutate(ratio=nb_deaths/pop)
head(answer_C %>% arrange(-ratio))
##   nb_deaths              name_dep     pop       ratio
## 1       774 Territoire de Belfort  138589 0.005584859
## 2       994                Vosges  358823 0.002770168
## 3      2817               Moselle 1038440 0.002712723
## 4      1009           Guadeloupe   375693 0.002685703
## 5      5662                 Paris 2142366 0.002642872
## 6       444           Haute-Marne  168231 0.002639228

It is now Belfort and not Paris anymore.

4.8. Practice

Finally, we want to add demographis data from the INSEE. More specifically, the share of a department’s population for the age ranges: 0-24, 25-59, 60-74, and 75 and above. You can find the data under rep_pop_age on Blackboard. This allows us to answer another question: are the departments with an older population on average more exposed to deadly COVID cases?

You are asked 1) to import the data into R and merge with the existing covid datasets and 2) to answer the questions.

  1. Merge the data, a hidden solution is provided below:
age_data <- read.csv("rep_pop_age.csv", sep=";", encoding="latin1")
merged_covid_dataset_v2 <- merge(merged_covid_dataset, age_data, by.x = "department", by.y = "department")
head(merged_covid_dataset_v2)
  1. Are the department with an older population on average more exposed to deadly COVID cases? A hidden solution is provided below:
top_5 = merged_covid_dataset_v2 %>% filter(gender=="males & females"  & date==max(date)) %>%  mutate(death_ratio=nb_deaths/pop)  %>% arrange(-share_75_plus)  %>% select(department,share_75_plus,death_ratio)
head(top_5)
bottom_5 = merged_covid_dataset_v2 %>% filter(gender=="males & females"  & date==max(date)) %>%  mutate(death_ratio=nb_deaths/pop)  %>% arrange(share_75_plus)  %>% select(department,share_75_plus,death_ratio)
head(bottom_5)
#the link is not obvious

Go to Kaggle, register, and download a dataset (.csv), load it into a data frame with R, and explore it with dplyr (i.e., generate summary statistics). For instance, you can find the Forbes list of worlds billionaires in 2018 there. It includes some information for each billionaire in the list. From there, you can produce some statistics by country, gender, or industry. A solution file is provided on Blackboard: kaggle_data.R. The dataset is also directly available on Blackboard: kaggle_billionaires.csv.

5. Visualizing the data

Sometimes, to answer a question, a graph is better than a table. We will now review ways to create plots in R, for instance to show the number of deaths over time in the form of a line plot, or to show, with bar plots, the total number of deaths per department

We restart from the merged COVID sample (that contains the population data). See the R file Section 5.R available on Blackboard.

5.1. Generate plots with ggplot

We use the package ggplot2 to map variables to nice graphics. More information about the package can be found here. We start by installing and loading the package’s functions to R.

install.packages("ggplot2")
library(ggplot2)

Now, we can use the package’s functions. Here you can find a great cheatsheet summarizing the key functions, what they are their arguments, and how to call them.

Let us start with something simple. We want to show the death rate for males and females together over time on a chart. To that end, we use first dplyr to create the measure we are after and then we construct a plot with ggplot. To change parameters or add actions to a plot created with ggplot, we use the sign “+”.

#create the dataframe with what we are after
death_rate_over_time = merged_covid_dataset %>% filter(gender=="males & females") %>%
  group_by(date) %>% summarize (tot_death=sum(nb_deaths), tot_pop=sum(pop)) %>% mutate(death_ratio=tot_death/tot_pop) %>% select(date, death_ratio)
head(death_rate_over_time)
## # A tibble: 6 x 2
##   date       death_ratio
##   <date>           <dbl>
## 1 2020-03-18  0.00000323
## 2 2020-03-19  0.00000485
## 3 2020-03-20  0.00000668
## 4 2020-03-21  0.00000779
## 5 2020-03-22  0.00000938
## 6 2020-03-23  0.0000128
#create a line plot with ggplot
ggplot(data=death_rate_over_time,aes(x=date,y=death_ratio))+ geom_line()

The change in the death rate may be more informative to find specific periods were the COVID was more lethal. lag() refers to the observation in the previous period, we make use of it to compute the change in the death rate:

#the change in the death rate may be more informative to find specific periods were the COVID was lethal
death_rate_over_time = death_rate_over_time %>% mutate(chg_death_ratio = death_ratio-lag(death_ratio))
#create a line plot with ggplot
ggplot(data=death_rate_over_time,aes(x=date,y=chg_death_ratio))+ geom_line()
## Warning: Removed 1 row(s) containing missing values (geom_path).

We now redo the graph but with a monthly frequency, as opposed to a daily one:

# monthly frequency
#create year-month flags
death_rate_over_time$mon_yr = format(death_rate_over_time$date, "%Y-%m") 
#for each year-month consider the observation as of the end of the month
death_rate_over_ym <- death_rate_over_time %>% group_by(mon_yr)  %>%  filter(date == max(date))
#create a line plot with ggplot
ggplot(data=death_rate_over_ym,aes(x=date,y=chg_death_ratio))+ geom_line()

We can now add some features to the chart: color

#create a line plot with ggplot
ggplot(data=death_rate_over_ym,aes(x=date,y=chg_death_ratio))+ geom_line() + geom_line(colour = "blue", size = 1) 

We can now add some feature to the chart: theme

#create a line plot with ggplot
ggplot(data=death_rate_over_ym,aes(x=date,y=chg_death_ratio))+ geom_line() + theme_classic()

We add a title, change the axis labels, add more values on the vertical axis, and show the axis labels vertically.

ggplot(data=death_rate_over_ym,aes(x=date,y=chg_death_ratio))+ geom_line(colour = "blue", size = 1) + theme_classic() + labs(x = "year month", y="Change in death ratio")+ ggtitle("Change in death ratio over months") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + theme(plot.title = element_text(hjust = 0.5)) +scale_x_date(date_breaks = "1 month") +scale_y_continuous(n.breaks=30)

Some fancy themes are available in other packages. For instance, we an install and load the package ggthemes.

install.packages("ggthemes")
#We load the package
library(ggthemes)

We can now use the theme The Economist (the magazine):

ggplot(data=death_rate_over_ym,aes(x=date,y=chg_death_ratio))+ geom_line(colour = "blue", size = 1) + theme_economist() + labs(x = "year month", y="Change in death ratio")+ ggtitle("Change in death ratio over months") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + theme(plot.title = element_text(hjust = 0.5)) +scale_x_date(date_breaks = "1 month") +scale_y_continuous(n.breaks=30)

The package ggplot offers a variety of other types of charts. Let us show the death rate per department as of the end of the sample period in the form of a bar chart. First, we need to retrieve the information from the COVID dataset using dplyr commands, then we can create and show the graph with ggplot:

#create the dataframe with what we are after
death_ratio_per_department = merged_covid_dataset %>% filter(gender=="males & females" & date==max(date)) %>%
  group_by(department) %>% mutate(death_ratio=nb_deaths/pop) %>% select(name_dep, death_ratio)
## Adding missing grouping variables: `department`
head(death_ratio_per_department)
## # A tibble: 6 x 3
## # Groups:   department [6]
##   department name_dep                death_ratio
##   <chr>      <chr>                         <dbl>
## 1 01         Ain                         0.00112
## 2 02         Aisne                       0.00253
## 3 03         Allier                      0.00219
## 4 04         Alpes-de-Haute-Provence     0.00203
## 5 05         Hautes-Alpes                0.00245
## 6 06         Alpes-Maritimes             0.00205
#use ggplot to create a bar chart
ggplot(death_ratio_per_department, aes(name_dep, death_ratio, fill=death_ratio)) + geom_bar(stat="identity")

As before, we can customize it further:

ggplot(death_ratio_per_department, aes(name_dep, death_ratio, fill=death_ratio)) +
  geom_bar(stat="identity", width=1.5) + theme_classic() +
  scale_fill_continuous(type = "viridis") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  scale_y_continuous(n.breaks=30)+
  labs(x = "Department", y="Death rate") +
  ggtitle("Death rate across departments") + 
  coord_cartesian( ylim=c(0,0.006), expand = FALSE ) +
  theme(axis.text.x=element_text(size=5))
## Warning: position_stack requires non-overlapping x intervals

5.2. Practice

Now that we know how to import data into a data frame, prepare our dataset, and show graphical outputs, let us practice. Based on the same set of data, you are asked to do the following tasks (a possible solution is hidden, you can click the button Code to unhide it).

Exercise 1

Create a bar plot showing the number of deaths by gender (male or female) as of end of March 2021.

new_data <- merged_covid_dataset %>% filter(date=='2021-03-31'& gender!="males & females") %>% select(c(nb_deaths,gender)) %>% group_by(gender) %>% summarize(sum_nb_deaths=sum(nb_deaths))
ggplot(data=new_data,aes(gender,sum_nb_deaths)) + geom_col(aes(fill=gender))+ theme_classic() +  ggtitle("Total number of deaths from COVID as of 2020/03/31")+labs(fill = "Gender categories",x="", y="Number of deaths")

Exercise 2

Show the number of reanimations (as a % of the population) in intensive care for males & females over time in the form of a line plot. Hint: we need to summarize across departments.

new_data2 <- merged_covid_dataset %>% filter(gender=="males & females") %>%  group_by(date) %>% mutate(sum_reanimations=sum(nb_reanimations))  %>% mutate(sum_pop=sum(pop)) %>% mutate(pct_reanimations=(sum_reanimations/sum_pop)*100) %>%  select(c(pct_reanimations,date))

ggplot(data=new_data2,aes(x=date,y=pct_reanimations)) + 
  geom_line(size=1) + theme_economist() + 
  ggtitle("% French population in intensive care over time")+
  theme(plot.title = element_text(hjust = 0.5))+
  scale_x_date(date_breaks = "1 month")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
  scale_y_continuous(n.breaks=20)+
  theme(axis.title.x = element_blank(),axis.title.y = element_blank())

Exercise 3

Go to Kaggle, register, and download a dataset (.csv), load it into a data frame with R, and explore it with dplyr & ggplot (i.e., generate summary statistics and plots). For instance, you can find the Forbes list of worlds billionaires in 2018 there. It includes some information for each billionaire in the list. From there, you can produce some statistics by country, gender, or industry, and find an appropriate way to plot the information. A solution file is provided on Blackboard: kaggle_data_2.R. The dataset is also directly available on Blackboard: kaggle_billionaires.csv.

6. Maps

Because we have the information by department, representing the information on a map can be a quick way to convey the gist of the data and share some insights. This section is covered by the R file Section 6.R, available on Blackboard.

There is a first package in R that contains a collection of functions that will help us to design a map and custom it based on some values per department that we can find in our dataset. We need to import the package raster to get coordinates of regions and departments.

install.packages("raster")
library(raster)

6.1. Raster package

We can use the package raster to create a map showing the percentage of a department’s population that died from the COVID-19 as of end of our sample period. We restart from our dataframe of death ratios per department as of the end of our sample period:

library(raster)
#we restart from the data per department
head(death_ratio_per_department)
## # A tibble: 6 x 3
## # Groups:   department [6]
##   department name_dep                death_ratio
##   <chr>      <chr>                         <dbl>
## 1 01         Ain                         0.00112
## 2 02         Aisne                       0.00253
## 3 03         Allier                      0.00219
## 4 04         Alpes-de-Haute-Provence     0.00203
## 5 05         Hautes-Alpes                0.00245
## 6 06         Alpes-Maritimes             0.00205
#example with raster
#we load the map data of France - at the department level, see getData(). The object we get is a geospatial data frame.
shapes <- getData(name="GADM", country="FRA", level=2)
#looking at the data, we see that NAME_2 gives the list of the departments in our forme file
head(shapes)
##   GID_0 NAME_0   GID_1               NAME_1 NL_NAME_1     GID_2      NAME_2
## 1   FRA France FRA.1_1 Auvergne-Rhône-Alpes      <NA> FRA.1.1_1         Ain
## 5   FRA France FRA.1_1 Auvergne-Rhône-Alpes      <NA> FRA.1.2_1      Allier
## 6   FRA France FRA.1_1 Auvergne-Rhône-Alpes      <NA> FRA.1.3_1     Ardèche
## 7   FRA France FRA.1_1 Auvergne-Rhône-Alpes      <NA> FRA.1.4_1      Cantal
## 8   FRA France FRA.1_1 Auvergne-Rhône-Alpes      <NA> FRA.1.5_1       Drôme
## 9   FRA France FRA.1_1 Auvergne-Rhône-Alpes      <NA> FRA.1.6_1 Haute-Loire
##      VARNAME_2 NL_NAME_2      TYPE_2  ENGTYPE_2 CC_2 HASC_2
## 1         <NA>      <NA> Département Department   01  FR.AI
## 5 Basses-Alpes      <NA> Département Department   03  FR.AL
## 6         <NA>      <NA> Département Department   07  FR.AH
## 7         <NA>      <NA> Département Department   15  FR.CL
## 8         <NA>      <NA> Département Department   26  FR.DM
## 9         <NA>      <NA> Département Department   43  FR.HL
#in our COVID dataset and in our map data, there are department names we can use to do a matching
#We cannot use the merge function here because shapes is a geospatial data frame, 
#Hence, we use the following procedure to add the COVID data to the map data:
#we create an index linking the department of both datasets. 
#we make us of the match function of R: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/match
#match() returns a vector of the positions of (first) matches of its first argument in its second.
idx <- match(shapes$NAME_2, death_ratio_per_department$name_dep)
#Do head(idx), idx tells us that the first NAME of shapes$NAME_2 has to be linked to the first name of death_ratio_per_department$name_dep for instance, 
#but the second name of shapes$NAME_2 with the third name of death_ratio_per_department$name_dep
#so IDX gives us the row positions for the department names of shapes in death_ratio_per_department
#we extract the death_to_pop ratio from death_ratio_per_department for the positions given by idx
concordance <- death_ratio_per_department[idx, "death_ratio"]
#and can now add them in the right order to the geospatial data frame
shapes$death_ratio <- concordance
#we select the map colors
couleurs <- colorRampPalette(c('white', 'red'))
#we call spplot to draw a map that shows the death_per_pop by department, using a color code to indicate the severity of the COVID.
spplot(shapes, "death_ratio",col.regions=couleurs(100),  main=list(label="% population of the department that died from COVID-19",cex=.8))

6.2. Leaflet package

To make our map more dynamic; we can create the map information (association between geographical areas and department data) with raster but show it with Leaflet. We use the default map provider of leaflet (OpenStreetMap) that does not require an API key:

library(leaflet)
#we use the same shapes file as we created above (in the raster example)
#we use the default map provider (OpenStreetMap) that does not require an API key
m <- leaflet(shapes)
#create colors
pal <- colorBin("YlOrRd", domain = as.matrix(shapes$death_ratio)*100, bins = 8)
#create labels
shapes$label <- paste(shapes$NAME_2,": ",round(as.matrix(shapes$death_ratio)*100,2),"% deaths from COVID in the pop",sep="")
labels <- shapes$label
#add polygons with colors and labels
m<- m %>% addPolygons(
  fillColor =  ~pal(as.matrix(shapes$death_ratio)*100),
  weight = 2,
  opacity = 1,
  color = "white",
  dashArray = "3",
  fillOpacity = 0.7,
  highlight = highlightOptions(
    weight = 5,
    color = "#666",
    dashArray = "",
    fillOpacity = 0.7,
    bringToFront = TRUE), label = labels, labelOptions = labelOptions(style = list("font-weight" = "normal", padding = "3px 8px"), textsize = "15px",direction = "auto"))
#add legends
m <- m %>% addLegend(pal = pal, values = round(as.matrix(shapes$death_ratio)*100,2), opacity = 0.7,position = "bottomright",title = "% of deaths to pop")
#show leaflet map
m

6.3. World map

Another alternative, especially if we want to show a world map, is the following. Here we want to show information about vaccination progress across countries in the world. We download this dataset - also available on Blackboard under world_vac.csv as of the 2021-10-04. The R file implementing this application is available on Blackboard under world_vac_case.R.

# we load the data
world_vac<-read.csv("world_vac.csv",sep=",")
# isolate what we want to show
world_vac<- world_vac %>% select(-c(Doses.Administered,Doses.per.1000,Vaccine.being.used.in.a.country))
#we use the rename() function of dplyr to change the name of some columns in our data frame (new name = old name)
world_vac<- world_vac %>% rename(pct_pop_vaccinated=Fully.Vaccinated.Population....)
world_vac<- world_vac %>% rename(country=Country)
head(world_vac)
##         country pct_pop_vaccinated
## 1         World               35.3
## 2         China               75.2
## 3         India               18.2
## 4 United States               56.7
## 5        Brazil               44.9
## 6         Japan               61.0
str(world_vac)
## 'data.frame':    207 obs. of  2 variables:
##  $ country           : chr  "World" "China" "India" "United States" ...
##  $ pct_pop_vaccinated: num  35.3 75.2 18.2 56.7 44.9 61 19.7 54.7 64.8 36.3 ...

Now we draw the map. We use the package rworldmap this time because it suits our purpose.

library(rworldmap)
#Then, we join the vaccination data to the world map of the library calling the joinCountryData2Map() function
joinData <- joinCountryData2Map( world_vac,
                                 joinCode = "NAME",
                                 nameJoinColumn = "country")
## 199 codes from your data successfully matched countries in the map
## 8 codes from your data failed to match with a country code in the map
## 44 codes from the map weren't represented in your data
# we drop antarticta zone we dont want to show
joinData <- subset(joinData, continent != "Antarctica")
#we plot the map
theMap <- mapCountryData(joinData, nameColumnToPlot="pct_pop_vaccinated",mapTitle="Percentage of the population vaccinated by country as of 2021-10-04")

6.4. Practice

Now find a dataset on data.gouv or Kaggle that you can use to draw an informative map and try to create and display it using one the packages we have reviewed in this section.

7. Creating an app with Shiny

You can find a good tutorial to prepare this part of the course here. For a more complete introduction to Shiny, please refer to this link, I recommend studying it in advance, before the class.

First, we install and load Shiny if not already done:

install.packages("shiny")
library(shiny)

7.1. A first app with Shiny

To create a Shiny app, a good practice is to create a new directory for your app, and put a single file called app.R in it for instance. To create a new directory use the explorer in the bottom right of RStsudio. We create a new folder called “My App” in our working directory. All the file (images, csv files) we need to use in our app need to be stored in this specific folder.

The code below is also available under app_ex_1.R on Blackboard. We report the code below but the outcome of the code cannot be shown directly in this document, we must run it in RStudio.

#Load the shiny package
library(shiny)
#There is the  user interface (ui) part: we set up what the user will see - we define the HMTL page users will interact with.
#currently, is show Hello, world!
ui <- fluidPage(
  "Hello, world!"
)
#There is the server part: we set up the computations we do and how the app reacts to inputs from an user
#it is currently empty, so the app does not do anything
server <- function(input, output, session) {
}
#we create the APP - it will start the app on your local machine - later on we will see how to publish this app online.
shinyApp(ui, server)

Run the code stored in app_ex_1.R in RStudio to see the app it creates.

The code tells Shiny both how the app should look, and how it should behave. The code defines the user interface (ui) and specifies the behavior of our app by defining a server function (server). Finally, it executes shinyApp(ui, server) to construct and start a Shiny application from ui and server. The app runs locally (http://127.0.0.1) as it is reported in the console.

Now that we know how to create an app that shows some text and are more familiar with the structure of a Shiny page, we are going to create our first online interactive dashboard.

7.2. A first interactive dashboard

We want to add some inputs and outputs to our user interface (ui). We are going to make a very simple app that shows to the user all the variables included in a data frame. The user can then pick the variables she wants information on. We will use the following Shiny functions:

  1. fluidPage() is a layout function that sets up the basic visual structure of the page.
  2. selectInput() is an input control that lets the user interact with the app by providing a value. In this case, the user uses a select box with the label “variables”, we let the user choose one of the variables of the dataset we preload.
  3. verbatimTextOutput() and tableOutput() are output controls that tell Shiny where to put rendered output. verbatimTextOutput() displays text and tableOutput() displays tables.

The steps are the following: we pre-load the dataset, we pre-load the variables names in the select box, the user chooses a variable in the select box, based on the choice of the user the app displays back some information about the variable.

We use the same COVID dataset as before: covid_data.csv. Make sure it is stored in the “My App” folder this time, where the R file that creates the app is also located. We create app_ex_2.R, to be stored in the folder “My App”, and available on Blackboard, as follows:

#Load the shiny and dplyr packages
library(shiny)
library(dplyr)

#we get COVID data from our usual source and format it the way we did before
covid_dataset <- read.csv(file="covid_data.csv",header=T,sep=";")
covid_dataset <- covid_dataset %>% select(c(dep,sexe,jour,hosp,rea,rad,dc))
covid_dataset$date<-as.Date(covid_dataset$jour, tryFormats = c("%d/%m/%Y"))
covid_dataset <- covid_dataset %>% select(c(-jour))
colnames(covid_dataset)<-c("department","gender","Number of hospitalizations","Number of reanimations","Number of returned home","Number of deaths","date")
covid_dataset$gender<-factor(covid_dataset$gender,labels=c("males & females","males","females"))

#create the user interface (ui), this time we specify more things:
ui <- fluidPage(
  #we create a select box containing all the variables of the data frame covid_dataset The name of this select box is "variable".
  varSelectInput("variable", label = "Please choose a variable", covid_dataset),
  #we show the text output called summary that is created by our server function upon an action of the user
  verbatimTextOutput("summary"),
  #we show the table called "table" that is created by our server function upon an action of the user
  tableOutput("table")
)

#we store the server functions here - where we perform calculations based on the user's choices and inputs
server <- function(input, output, session) {

  #we define a reactive, something that reacts to the user's actions. When the user changes something, we select the column of the dataframe based on the value selected by the user in the select box.
  #we define a reactive, something that reacts to the user's actions, that is we create a new dataframe called dataset based on the choices of the user
  dataset <- reactive({
    covid_dataset  %>% select(c(!!input$variable))
  })
  
  #we define a first output that we want to show. We apply the R summary() function to the variable of the dataframe selected by the user, which we have called "variable". We then use renderPrint() function from shiny so that it turns it into a piece of text to show to the user.
  output$summary <- renderPrint({
    summary(dataset())
  })
  
  #we define a second output that we want to show, it is the first five rows of the dataframe, we call head() We use renderTable() function from shiny so that it turn it into table to show to the user.
  output$table <- renderTable({
    head(dataset())
  })
}

#we create the app
shinyApp(ui, server)

Each render{Type} function is designed to produce a particular type of output (e.g., text, tables, or plots), and is often paired with a {type}Output function. For example, in the above app, renderPrint() is paired with verbatimTextOutput() to display a statistical summary with fixed-width (verbatim) text, and renderTable() is paired with tableOutput() to show the data in a table.

Run the code in RStudio to see the app it creates.

7.3. Deploying the app online

What if we want to publish our application online? The easiest way to turn our Shiny app into a web page is to use shinyapps.io, RStudio’s hosting service for Shiny apps. shinyapps.io lets us upload our app straight from our R session to a server hosted by RStudio. We have complete control over our app including server administration tools. To find out more about shinyapps.io, please visit the above link. We need to register there, then log in.

We connect Shiny to the freshly created web space by going to our shinyapps.io account and then retrieving the token information from there. We use the token information as explained here.

Once our publishing account is well configured, we do a test. We will publish on our web space the application app_ex_2.R. Open in RStudio app_ex_2.R, then click the Publish Application button. Select app_ex_2.R and, if needed, the files required for the data analysis (they should be selected by default), then validate.

Pick as a title for the application My_App_2. Once successfully published, you should be able to access the webpage with the following URL: https://YOURUSERNAME.shinyapps.io/My_App_2/. Of course, replace YOURUSERNAME by the user name you have chosen when you created your own account on shinyapps.io.

To see the one hosted on my shinyapps server, click here.

We now know how to create and publish an app (simple interactive dashboard) with R.

A final touch: let us say we want to generate and share a QRcode for our online application, we can easily do that too with R, we use the below code (after installing the package qrcode):

#create qrcode
library("qrcode")
png("qrplot.png")
qrcode_gen("https://agarel86.shinyapps.io/My_App_2/")
dev.off()
#show it
library("imager")
im<-load.image("qrplot.png")
plot(im)

7.4. Adding other functionalities and outputs

So far, the user can screen through the variables of the data frame and obtain summary information about the data. Now we let the user choose the data range of her choice. To do so, we need to add another select box. Based on the dates and the variable selected, we will show the total number by gender in the form of bar chart. We also restrict the variables the user can explore. We create app_ex_3.R, available on Blackboard (with comments). Run the code in RStudio to see the app it creates.

Next, we add another plot that shows a time series over time based on the variable choice. We create app_ex_4.R, available on Blackboard (with comments). Run the code in RStudio to see the app it creates.

7.5. Layout, information, titles, sections, fonts, and colors

We further customize our app. There are many layout options for us to change the way we show options.

We introduce a menu to navigate the outputs and we add some text to indicate what the dashboard is and what it allows the user to do. We use a panel layout to do so. We create app_ex_5.R, available on Blackboard (with comments). Run the code in RStudio to see the app it creates.

Next, we add a sidebar where the user will specify its inputs to concentrate them at the same place. Moreover, we play with the theme of the dashboard using the standard themes available from Shiny. We will also offer the user to exclude department 44 using a checkbox, and restrict our dataset to a total number of deaths lower than a certain threshold that the user will define. We create app_ex_6.R, available on Blackboard (with comments). Run the code in RStudio to see the app it creates.

7.6. Adding a map

Finally, we add map to our app, building on what we have learned. We create app_ex_7.R, available on Blackboard (with comments). To see it online, go there.

8. COVID applications

Now we review how to create and publish and online interactive Dashboard with R (Shiny) from scratch using other sets of data. Then, as your assignment for this course, you will have to create your own application using different data and adequate app functions, Shiny user interface, and Shiny server computations.

Consider the three following applications. For each of them, there is a commented R file that creates the associated Shiny application, the R files and the data are available on Blackboard, and a link to an online version of the app is displayed:

  1. We want to show information about the possible side effects from COVID vaccines. The commented R code is available on Blackboard under case_1.R, and an online version is here. It requires the data file covid side effects US.csv available on Blackboard.

  2. We want to look at the emotion and sentiment conveyed by COVID-related tweets. You can find the commented R code on Blackboard under case_2.R, and an online version is here. It requires the data file tweetid_userid_keyword_sentiments_emotions_France.csv available on Blackboard.

  3. We want to show, for a given city, on a map, all the places where you can get vaccinated. You can find the commented R code on Blackboard under case_3.R, and an online version is here. It requires the data file centres_vaccinations.csv available on Blackboard.

9. Financial applications

We now apply what we have learnt to the creation of financial online applications and dashboards:

9.1. Loan simulator

Check the R file available on Blackboard under mortgage_simulator_app.R. A version is deployed online there.

9.2. Pension simulator

Check the R file available on Blackboard under pension_calculator_app.R.

9.3. CAPM beta calculator

Check the R file available on Blackboard under beta_calculator.R, the dataset used is automatically retrieved from Yahoo Finance based on the ticker provided by the user.

9.4. Portfolio performance

Check the R file available on Blackboard under portfolio_app.R, the dataset used is available on Blackboard under portfolio_data.csv.

9.5. ESG Dashboard for French companies

Check the R file available on Blackboard under ESG_dashboard.R, the dataset used is available on Blackboard under esg_data.csv.

9.6. Show the data collected by Impact

Impact is a platform created by the French government that aims to collect data from listed and large companies on 47 ESG metrics. Companies are asked to registred and fill the data on a voluntary basis. One of the aim of the platform is to provide transparent and public data about the sustainabiltiy of French companies. However, so far, the data comes in the form of an Excel file that you can find here. A lot can be done to make the information more visual and easy to digest. We create an app to share the data on carbon emissions in particular to a broader audience in a digest way.

Check the R file available on Blackboard under impact_results.R, the dataset used is available on Blackboard under impact_raw.csv.

10. Assignment

10.1. Instructions

As a group of four, create your own online interactive dashboard or online application with R. It can be based on any data. A dashboard should allow the user to visualize the data in the most complete and relevant way and respond to the user’s inputs. You can use the data source of your choice, feel free to consider data.gouv.org or Kaggle, they are many other sources.

The groups are already defined and available on Blackboard. The due date is the 21/05/2022.

The marking criteria (out of 20) are:

  1. Completeness : steps from getting the data to publishing the application online - 6 marks.
  2. Data analysis : data used, retreatments, choices given to the user, relevance of the data and of the analyses - 4 marks.
  3. Visuals : quality and appropriateness of the visuals (chart, map) used to render the information - 4 marks.
  4. Originality : Ability to depart from taught material and incorporate new features / analyses / tools - 3 marks.
  5. Readability : readability of the information on the dashboard - 1 mark.
  6. Comments : clarity and relevance of the comments in the code - 1 mark.
  7. Guidance : clarity and relevance of the guidance/instructions given to the user - 1 mark.

The mark is worth 25% of the final grade for the course.

10.2. Example of past students’ projects

Pleased find below some examples of students’ projects:

11. Final exam

Your final exam for the R part counts for 25% of the final grade for this course. It will take the form of a one-hour quizz. You will be presented some code and will have to find what it does and/or whether the statement(s) is true or false.

You can find below some examples. The correct answer is marked in bold.

11.1. Example 1

Which of the below sentences about the following code is False?

my_function = function(an_integer){

alphabet="ABCDEFGHIJKLMNOPQRSTUVWXYZ"

  if (is.numeric(an_integer)==FALSE){
  print("not a numerical value, cannot proceed further")
  }
  else
  {
    if (an_integer!=round(an_integer)){
      print("a numerical value, but not an integer,cannot proceed further")
    }
    else
    {
      if (an_integer>26){
        print("an integer, but greater than 27, I cannot proceed further")
      }
      else
      {
        letter_matching_pos = substr(alphabet,an_integer,an_integer)
        print(paste("The alphabet letter matching the position is:",letter_matching_pos))
      }
    }
    
  }
  
}
  • A. my_function(“1”) prints in the console: “not a numerical value, cannot proceed further”
  • B. my_function(27) prints in the console: The alphabet letter matching the position is: Z
  • C. my_function(23.56) prints in the console: “a numerical value, but not an integer,cannot proceed further”
  • D. my_function(13) prints in the console: “The alphabet letter matching the position is: H”

11.2. Example 2

Which of the below sentences about the following code is true?

a_finder = function(a_word){

cpt=0
word_upper=toupper(a_word)

  for (i in 1:nchar(word_upper)){
  letter = substr(word_upper,i,i)
  
    if(letter=="A"){
    cpt=cpt+1
    }
  
  }

return(cpt)
}
a_finder("ANaNaS")
  • A. a_finder(“ANaNaS”) returns 2.
  • B. a_finder(“ANaNaS”) returns 3.
  • C. a_finder(“ANaNaS”) returns AaA.
  • D. a_finder(“ANaNaS”) returns AAA.

11.3. Example 3

Consider this dataset (z_data):

X name gender age_group
10 Robert male old
20 George male old
10 Tom male young
30 Ana female young
40 Tabatha female old
10 Danielle female old

Using dplyr, we want to retrieve, for males only and by age group, the average value of X. Which one of the following codes is the right one?

  • A.
z_data %>%  filter(gender=="males") %>%  select(c(X, age_group)) %>% group(age_group) %>% summarize(avg_X=mean(X))
z_data %>%  filter(gender=="males") %>%  select(c(X, age_group)) %>% mutate(avg_X=mean(X))
z_data %>%  filter(gender=="males" & age_group=="old") %>%  select(c(X, age_group)) %>% summarize(avg_X=mean(X))
z_data %>%  filter(gender=="males") %>%  select(c(age_group)) %>% group(age_group) %>% summarize(avg_X=sum(X))

12. Contact information

You can contact your lecturer, Alexandre Garel, at . My personal website is there. You can add me on LinkedIn, I often post about ESG ratings and finance research pertaining to ESG metrics and climate-change issues.