0. Introduction

This course aims to teach you the fundamentals of programming with R, and then how to use Shiny to build an online interactive dashboard.

This link shows you a notorious illustration of an elaborated interactive dashboard covering the COVID-19 pandemic in France. This could be done with R & Shiny. Here is another example of what can be produced with R. Another example of Dashboard can be found here.

Dashboards are a great way to display metrics, explore a dataset, and visualize data in an interactive fashion (interactive means that the information presented to the user reacts to the user’s inputs).

Shiny provides a web framework for building interactive web applications with R. It works well on computers, tablets, and mobile devices. It allows to develop and publish a data analysis app.

In this course, you will learn how to create and publish and online interactive dashboard with R (Shiny). In 12 hours, we will learn how to create a basic online interactive dashboard.This course guides you through all the steps required to be able to do so. Your assessment will consist in developing your own dashboard and publishing it (as a team of 3).

Please also consider this course as an opportunity to discover R programming language and to open to all the things you can create with it. For instance, the website supporting this course has been developed with R (using the Markdown package).

Finally, notice that this course is not a course where you can go idle until couple of days before the final exam. It requires continuous preparation, dedication, and practice. Trial and error is key. You must try coding by yourself, most likely your code will not work at first, but you will also learn how to find help and solve coding problems when you are stuck.

There are three sessions of 4 hours. The first session covers sections 0 to 2 (November 4, 2021). The second session covers section 3 (November 18, 2021). The third session covers sections 4 to 5 (November 25, 2021). Section 6 provides details regarding the group project expected for this course.

1. R: A programming language

Computers have their own language called machine code, which tells them what to do. A programming language provides an interface between a programmer and the machine language. When you write a code in the R programming language, you write a code that tells the machine what to do. The code is then compiled, which turns it into machine code the computer understands.

R is a programming language and free software. It possesses an extensive catalog of statistical and graphical methods. It includes machine learning algorithms, linear regression, time series, statistical inference to name a few. R is one of the dominant languages for data analysis in the business and finance industries.

To code with R, we will use RStudio. RStudio is an integrated development environment for R, with a console, syntax-highlighting editor that supports direct code execution, and tools for plotting, history, debugging and workspace management. RStudio should be already installed on all the machines, if not please contact the IT service. For more information about RStudio please visit this link. Notice that you can easily install R and RStudio on you own device, to do so, please check this link.

A great thing with R it that many users have already created so-called packages, which are collections of functions you can use to perform specific tasks. For instance, if you want to create a plot, you do not have to write a function from scratch, you can use a package, in that case ggplot2, which already has functions designed to create nice plots. The only thing you have to learn is what are the arguments these functions require, i.e, how to adequately call these functions.

You will find below a simple illustration of what can be achieved with few R code lines. We want to generate random numbers and then create a histogram of them. As you will see, some code lines start with a #, it means that those are comments, there for you to understand what is expected from the code (the machine is not going to read these lines).

library(ggplot2)
#create a data frame with 50 rows and 1 column
df <- data.frame(matrix(ncol = 1, nrow = 50))
#provide the column name
colnames(df) <- c('random_number')
#populate the column with random numbers
df$random_number <- runif(n = 50, min = 1, max = 10)
#plot a histogram of the random numbers with ggplot
ggplot(data = df, aes(x=random_number)) + geom_histogram(binwidth=0.5, color="black", fill="white") + labs(title="Random numbers histogram plot",x="Random numbers", y = "Count") + theme_dark()

Note that, on this website, you can see the output of the R code, it appears below the code. However, it is strongly recommended to also run it on your own machine, using RStudio, because it is on your machine that you can modify the code.

2. Basics of programming with R

First, let us consider the interface of RStudio. As indicated here, there are four main windows. The script editor is where you type the code. By clicking the run button you can run the code. You can also execute R commands straight in the console window (bottom-left). On the top-right, there is the environment window where all the variables and datasets we create are listed. On the bottom-right, we can navigate files, consult the help, show plots, and explore the list of R packages installed on the machine.

2.2. Define a variable

A basic concept in programming is called a variable. A variable allows us to store a value (e.g., 5 or “soccer”) or an object (such as a function). We can then later use this variable’s name to easily access the value or the object that is stored within this variable. To assign a value to a variable, we use <-. Consider the below examples:

my_variable <-  5
print(my_variable)
## [1] 5
my_variable <-  "soccer"
print(my_variable)
## [1] "soccer"

At any time, in RStudio you can go to the Menu and click the file tab to save your script. Go save as and then select the folder and type the name of the file. It will create and R file: xxx.R.

2.4. Data types

There are numerous data types.

#Decimal values like 4.5 are floats, they belong to numerics.
dec <- 4.5
#Whole numbers like 4 are called integers. Integers are also numerics.
whole <- 9
#Boolean values (TRUE or FALSE) are called logical.
my_value <- TRUE
#Text (or string) values are called characters.
my_string <- "This is a string"

We can check the type of a variable by using the function class(), for instance:

my_value <- TRUE
class (my_value)
## [1] "logical"

2.5. Arithmetic with R

We can easily perform some arithmetic calculations with R.

my_variable1 <-  5+6
print(paste("The result of the addition is", my_variable1,sep=" "))
## [1] "The result of the addition is 11"
my_variable2 <-  5-6
print(paste("The result of the substraction is", my_variable2,sep=" "))
## [1] "The result of the substraction is -1"
my_variable3 <-  5*6
print(paste("The result of the multiplication is", my_variable3,sep=" "))
## [1] "The result of the multiplication is 30"
my_variable4 <-  5/6
print(paste("The result of the division is", my_variable4,sep=" "))
## [1] "The result of the division is 0.833333333333333"
#Return on equity (ROE) calculation
firm_earnings <- 150
firm_equity <- 1000
firm_ROE <- firm_earnings/firm_equity
print(firm_ROE)
## [1] 0.15

2.6. Relational operators

Relational operator (comparators sur as ==, !=, >, <, >=, <=) help us see how one R object relates (compares) to another (are they equal or unequal for instance). The comparison returns TRUE or FALSE:

#equality ==
print (1==1)
## [1] TRUE
#equality ==
var1=1
var2=2
print (var1==var2)
## [1] FALSE
#inequality !=
print (1!=1)
## [1] FALSE
#inequality !=
print (1!=2)
## [1] TRUE
#less than or greater than
print (1<3)
## [1] TRUE
#less than or greater than
print (1>0)
## [1] TRUE

2.7. Logical operators

A logical operator is a symbol or word used to connect two or more objects (AND, OR, etc…):

#AND(&)
print (15 == 15 & 15> 13)
## [1] TRUE
#AND(&)
print (15 == 15 & 15> 16)
## [1] FALSE
#OR(|)
print (15 == 15 | "A"=="A")
## [1] TRUE
#OR(|)
print (15 == 17 | "A"=="A")
## [1] TRUE

2.8. Conditional statements

A conditional statement is a statement that gives an algorithm the ability to make decision (for instance if it rains then show me the TV program). Below is a if statement stating that if the number of views (we assume those are Instagram views for instance) exceeds 15, then the user can be considered as popular…

num_views<-16
if (num_views > 15) {
  print("You are popular!")
}
## [1] "You are popular!"

We can add an else statement to explicitly consider the case where the number of views does not exceed 15:

num_views<-10
if (num_views > 15) {
  print("You are popular!")
} else {
  print("You are not popular!")
}
## [1] "You are not popular!"

If we want to consider several alternatives, we can use else if statements:

num_views<-16
if (num_views <= 10) {
  print("You are really not popular!")
} else if(num_views > 10 & num_views<= 15) {
  print("You are not popular!")
} else if(num_views > 15 & num_views<= 20) {
  print("You are popular!")
} else if(num_views > 20) {
  print("You are really popular!")
}
## [1] "You are popular!"

2.9. Loops

With loops we can keep having the algorithm doing something while a condition holds. In the below example, we use a while loop that reduces the value of the variable speed as long as the condition (speed value > 30) is met:

speed <- 64
while (speed > 30) {
  print(paste("Slow down! Your speed is:", speed, sep=" "))
  speed <- speed - 7
}
## [1] "Slow down! Your speed is: 64"
## [1] "Slow down! Your speed is: 57"
## [1] "Slow down! Your speed is: 50"
## [1] "Slow down! Your speed is: 43"
## [1] "Slow down! Your speed is: 36"

We can combine a while loop with a break. For instance, in the below example we use a while loop to increase speed if speed is greater than 30. We add a break command, so that when the speed exceeds 80, we break the loop and stop the algorithm.

speed <- 35
while (speed > 30) {
  speed<-speed+3
  print(paste("Your speed is", speed))
  # Break the while loop when speed exceeds 80
  if (speed > 80) {
    print("speed it too high, we stop and break the loop")
    break
  }
}
## [1] "Your speed is 38"
## [1] "Your speed is 41"
## [1] "Your speed is 44"
## [1] "Your speed is 47"
## [1] "Your speed is 50"
## [1] "Your speed is 53"
## [1] "Your speed is 56"
## [1] "Your speed is 59"
## [1] "Your speed is 62"
## [1] "Your speed is 65"
## [1] "Your speed is 68"
## [1] "Your speed is 71"
## [1] "Your speed is 74"
## [1] "Your speed is 77"
## [1] "Your speed is 80"
## [1] "Your speed is 83"
## [1] "speed it too high, we stop and break the loop"

There is also a for loop. The latter does an action for each value of a list of values. For instance, below, we create a list of prime numbers, and we then have our code telling the machine, with a for loop, to print each of the prime number included in that list. Notice that are they several ways to make that happen. We are going to see how to create a list very shortly.

primes <- c(2, 3, 5, 7, 11, 13)
for (p in primes) {
  print(p)
}
## [1] 2
## [1] 3
## [1] 5
## [1] 7
## [1] 11
## [1] 13
for (i in 1:length(primes)) {
  print(primes[i])
}
## [1] 2
## [1] 3
## [1] 5
## [1] 7
## [1] 11
## [1] 13

2.10. Vectors

Vectors are one-dimension arrays that can hold numeric data, character data, or logical data. A vector is a simple tool to store data. For example, you can store your daily gains and losses in casinos. To create a vector containing A, B, and C, we use c(A, B, C). Sometimes, we refer to such vectors as lists.

numeric_vector <- c(1, 2, 3)
print(numeric_vector)
## [1] 1 2 3
character_vector <- c("a", "b", "c", "d")
print(character_vector)
## [1] "a" "b" "c" "d"

It is important to have a clear view on the data that you are using. Understanding what each element refers to is therefore essential. To help, we can name the elements of a vector.

character_vector <- c("a", "b", "c", "d")
names(character_vector)<- c("First letter", "Second letter", "Third Letter", "Fourth Letter")
print(character_vector)
##  First letter Second letter  Third Letter Fourth Letter 
##           "a"           "b"           "c"           "d"

We can also perform calculations with vectors:

numeric_vectorA <- c(1, 2, 3)
numeric_vectorB <- c(1, 2, 3)
vector_A_plus_B<-numeric_vectorA+numeric_vectorB
print(vector_A_plus_B)
## [1] 2 4 6

We call use the built-in R function sum() to sum of all the values present in the vector. This function returns the value of the sum.

numeric_vectorA <- c(1, 2, 3)
sum_element_vector_A <- sum(numeric_vectorA)
print(sum_element_vector_A)
## [1] 6

By the same token, we can compute the mean value of the elements of a vector, by using the built-in R function mean().

numeric_vectorA <- c(1, 2, 3)
mean_element_vector_A <- mean(numeric_vectorA)
print(mean_element_vector_A)
## [1] 2

We can select some elements of a vector only. To do so we use brackets, and refer to the position of the elements in the vector.

numeric_vectorX <- c("X", "XX", "ZZZ")
numeric_vectorX23 <- numeric_vectorX[2:3]
print (numeric_vectorX23)
## [1] "XX"  "ZZZ"

We can go through the elements of a vector using a loop, we now know how to do it. In the below example we create a vector linkedin that stores the number of views of different users. We then name each element of the vector after the user names. We want to print the names of the users for which the number of views is greater than 5:

#create the vector of number of views
linkedin<-c(10,20,5,0,10,20,30,54)
#create the vector of names (of users) corresponding to the number of views
names(linkedin)<- c("Jean Paul", "Mark", "Helmut", "Brigitte", "Larz", "Claire", "Toto", "Cici")
#show the elements of the vector meeting the condition number of views > 5
print(linkedin > 5)
## Jean Paul      Mark    Helmut  Brigitte      Larz    Claire      Toto      Cici 
##      TRUE      TRUE     FALSE     FALSE      TRUE      TRUE      TRUE      TRUE

Second, for each element (number of views) of the vector, we want to check whether the user can be considered as popular, that is whether the number of views is greater than 10 in that case, using a loop this time:

linkedin<-c(10,20,5,0,10,20,30,54)
names(linkedin)<- c("Jean Paul", "Mark", "Helmut", "Brigitte", "Larz", "Claire", "Toto", "Cici")
#loop through all the elements of the vector (list) linkedin
for (i in 1:length(linkedin)) {
  print(linkedin[i])
  if (linkedin[i] > 10) {
    print("You're popular!")
  } else {
    print("Be more visible!")
  }
}
## Jean Paul 
##        10 
## [1] "Be more visible!"
## Mark 
##   20 
## [1] "You're popular!"
## Helmut 
##      5 
## [1] "Be more visible!"
## Brigitte 
##        0 
## [1] "Be more visible!"
## Larz 
##   10 
## [1] "Be more visible!"
## Claire 
##     20 
## [1] "You're popular!"
## Toto 
##   30 
## [1] "You're popular!"
## Cici 
##   54 
## [1] "You're popular!"

2.11. Matrices

A matrix is a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns. Assume we want to analyze the box office numbers of the Star Wars franchise. For each movie, the first element of each vector indicates the US box office revenue and the second element refers to the Non-US box office.

#We create three vectors of revenues. 
#In each vector, the first number corresponds to the revenue for the US box office, the second number the revenue for the non-US box office.
new_hope <- c(461,314)
empire_strikes <- c(290, 248)
return_jedi <- c(309, 165)

To create a matrix, we call the built-in function matrix(), we must indicate the number of rows we desire. We also set the option byrow to TRUE to fill the matrix by row. That is the vector new_hope will become the first row of the matrix, the vector empire-strikes the second row, and so on. If we want to fill a matrix by column, we must set byrow to FALSE.

# we create a matrix out of the three vectors we already created
star_wars_matrix <- matrix(c(new_hope, empire_strikes, return_jedi), nrow = 3, byrow = TRUE)
print(star_wars_matrix)
##      [,1] [,2]
## [1,]  461  314
## [2,]  290  248
## [3,]  309  165

We then create vectors of regions and titles, used for naming of the matrix’s columns and rows. It helps to better see what the numbers are referring to.

regions <- c("US", "non-US")
titles <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")
colnames(star_wars_matrix)<-regions
rownames(star_wars_matrix)<-titles
print(star_wars_matrix)
##                          US non-US
## A New Hope              461    314
## The Empire Strikes Back 290    248
## Return of the Jedi      309    165

Alternatively, we can construct the matrix with the right names for columns and row from the beginning:

star_wars_matrix <- matrix(c(new_hope, empire_strikes, return_jedi), 
                           nrow = 3, byrow = TRUE,
                           dimnames = list(titles, regions))
print(star_wars_matrix)
##                          US non-US
## A New Hope              461    314
## The Empire Strikes Back 290    248
## Return of the Jedi      309    165

We can perform standard operations on the elements of the matrix, like summing up the values stored in rows or/and columns. Below, we compute the total revenue for US and non-US. To do so, we call the built-in function colSums(), which computes the sum of each column’s values.

total_rev_region=colSums(star_wars_matrix)
print(total_rev_region)
##     US non-US 
##   1060    727

We then compute the total revenue per movie. To do so, we call the built-in function rowSums(), which computes the sum of each row’s value:

total_rev_mov=rowSums(star_wars_matrix)
print(total_rev_mov)
##              A New Hope The Empire Strikes Back      Return of the Jedi 
##                     775                     538                     474

We finally compute the sum of the revenues for all movies and regions, using the function sum():

total_rev=sum(star_wars_matrix)
print(total_rev)
## [1] 1787

To add an extra row or column to the matrix, we use the function rbind() or cbind(), respectively. To add the extra column summing US and non US revenues for each movie, we bind this column with the rest of the matrix, by calling the function cbind().

final_matrix_starwars<-cbind(star_wars_matrix,total_rev_mov)
print(final_matrix_starwars)
##                          US non-US total_rev_mov
## A New Hope              461    314           775
## The Empire Strikes Back 290    248           538
## Return of the Jedi      309    165           474

To add the extra row summing revenues for each region, we bind this row with the rest of the matrix, by calling the built-in function rbind().

final_matrix_starwars<-rbind(star_wars_matrix,total_rev_region)
print(final_matrix_starwars)
##                           US non-US
## A New Hope               461    314
## The Empire Strikes Back  290    248
## Return of the Jedi       309    165
## total_rev_region        1060    727

Similar to vectors, we can use the square brackets to select one or multiple elements from a matrix. We use a comma to separate the rows we want to select from the columns. We go:

my_matrix[row_we_want , column_we_want]

To select non-US movies only, we use [,2], that is all the rows of the original matrix but only the column number two:

print(star_wars_matrix[,2])
##              A New Hope The Empire Strikes Back      Return of the Jedi 
##                     314                     248                     165

To select revenues in both regions for the second movie only (The Empire Strikes Back), we use [2,]:

print(star_wars_matrix[2,])
##     US non-US 
##    290    248

To select US revenues for the two first movies only, we use [1:2,1]:

print(star_wars_matrix[1:2,1])
##              A New Hope The Empire Strikes Back 
##                     461                     290

The standard operators like +, -, /, * work in an element-wise way on matrices. We can divide by 10 all the values stored in the star wars matrix for instance:

print(star_wars_matrix/10)
##                           US non-US
## A New Hope              46.1   31.4
## The Empire Strikes Back 29.0   24.8
## Return of the Jedi      30.9   16.5

We can also multiply two matrices. Let us assume that we have two matrices, one with the ticket prices for each movie and region and another one with the number of visitors for each movie and region. We can compute the revenue per region of each movie:

# We first create ticket price matrices
new_hope <- c(5.0, 5.0)
empire_strikes <- c(6.0, 6.0)
return_jedi <- c(7.0, 7.0)
star_wars_matrix_ticket <- matrix(c(new_hope, empire_strikes, return_jedi), 
                           nrow = 3, byrow = TRUE,
                           dimnames = list(titles, regions))
print(star_wars_matrix_ticket)
##                         US non-US
## A New Hope               5      5
## The Empire Strikes Back  6      6
## Return of the Jedi       7      7
# We then create the number of visitors  matrix
new_hope <- c(1000000, 2000000)
empire_strikes <- c(15000000, 5260000)
return_jedi <- c(42100000, 7000000)
star_wars_matrix_vis <- matrix(c(new_hope, empire_strikes, return_jedi), 
                                  nrow = 3, byrow = TRUE,
                                  dimnames = list(titles, regions))
print(star_wars_matrix_vis)
##                               US  non-US
## A New Hope               1000000 2000000
## The Empire Strikes Back 15000000 5260000
## Return of the Jedi      42100000 7000000
# Then, we multiply the matrices
star_wars_matrix_rev<-star_wars_matrix_ticket*star_wars_matrix_vis
print(star_wars_matrix_rev)
##                                US   non-US
## A New Hope                5000000 10000000
## The Empire Strikes Back  90000000 31560000
## Return of the Jedi      294700000 49000000

2.12. Data frames

Key for any data analysis. A data frame has the variables of a dataset as columns and the observations as rows.The advantage of a data frame over a matrix is that whereas all the elements of a matrix must have the same type, only the elements within a column are required to have the same type for a data frame. Different columns can be of different data type. We can mix strings, booleans, and decimal numbers for instance. Most of the functions we have seen for matrices apply to data frames too.

For illustration purposes, we will work with a data frame that is built-in R. This data frame is mtcars. Data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). For more information about the mtcars dataset please refer to this link. Let us first print the data frame mtcars to see how it looks like:

print(mtcars)
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

It is often useful to show only a small part of the entire dataset to see its structure, this is what achieves the head () function:

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Another method that is often used to get a rapid overview of the data in a data frame is the function str(). It shows you the structure of the data set:

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

We now cover how to create a data frame from scratch. We want to create a data frame of planets, giving some information for each planet of the dataset. We will create a data frame out of vectors. We define a series of vectors and then combine them into a data frame:

# We define the vectors of planet names and planet types
name <- c("Mercury", "Venus", "Earth", 
          "Mars", "Jupiter", "Saturn", 
          "Uranus", "Neptune")
type <- c("Terrestrial planet", 
          "Terrestrial planet", 
          "Terrestrial planet", 
          "Terrestrial planet", "Gas giant", 
          "Gas giant", "Gas giant", "Gas giant")
# Importantly, we define the vector type as a factor, that is as a vector taking a small number of distinct values. A factor variable is a variable used to categorize and store the data. 
type<-factor(type)
diameter <- c(0.382, 0.949, 1, 0.532, 
              11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 
              0.41, 0.43, -0.72, 0.67)
# We create a data frame from the vectors
planets_df <- data.frame(name, type, diameter, rotation)

We explore our data frame:

head(planets_df)
##      name               type diameter rotation
## 1 Mercury Terrestrial planet    0.382    58.64
## 2   Venus Terrestrial planet    0.949  -243.02
## 3   Earth Terrestrial planet    1.000     1.00
## 4    Mars Terrestrial planet    0.532     1.03
## 5 Jupiter          Gas giant   11.209     0.41
## 6  Saturn          Gas giant    9.449     0.43
str(planets_df)
## 'data.frame':    8 obs. of  4 variables:
##  $ name    : chr  "Mercury" "Venus" "Earth" "Mars" ...
##  $ type    : Factor w/ 2 levels "Gas giant","Terrestrial planet": 2 2 2 2 1 1 1 1
##  $ diameter: num  0.382 0.949 1 0.532 11.209 ...
##  $ rotation: num  58.64 -243.02 1 1.03 0.41 ...

We can add a column or a row using the function cbind() and rbind(), respectively. For instance, to add a column about planet rings:

rings_vector<-c(FALSE, FALSE, FALSE, FALSE,  TRUE,  TRUE,  TRUE,  TRUE)
planets_df<-cbind(planets_df,rings_vector)
print(planets_df)
##      name               type diameter rotation rings_vector
## 1 Mercury Terrestrial planet    0.382    58.64        FALSE
## 2   Venus Terrestrial planet    0.949  -243.02        FALSE
## 3   Earth Terrestrial planet    1.000     1.00        FALSE
## 4    Mars Terrestrial planet    0.532     1.03        FALSE
## 5 Jupiter          Gas giant   11.209     0.41         TRUE
## 6  Saturn          Gas giant    9.449     0.43         TRUE
## 7  Uranus          Gas giant    4.007    -0.72         TRUE
## 8 Neptune          Gas giant    3.883     0.67         TRUE

As for vectors and matrices, we can select some elements from a data frame with the help of square brackets. For instance, print out the diameter of Mercury (row 1, column 3):

print(planets_df[1,3])
## [1] 0.382

Print out data for Mars (entire fourth row):

print(planets_df[4,])
##   name               type diameter rotation rings_vector
## 4 Mars Terrestrial planet    0.532     1.03        FALSE

To select a specific column, we can also use its name:

print(planets_df[1:3,"type"])
## [1] Terrestrial planet Terrestrial planet Terrestrial planet
## Levels: Gas giant Terrestrial planet

We can also put a restriction on the row of the data frame we select. For instance, below we show the data frame for planets that have a diameter greater than 1 (and select all the columns):

print(planets_df[planets_df$diameter>1,])
##      name      type diameter rotation rings_vector
## 5 Jupiter Gas giant   11.209     0.41         TRUE
## 6  Saturn Gas giant    9.449     0.43         TRUE
## 7  Uranus Gas giant    4.007    -0.72         TRUE
## 8 Neptune Gas giant    3.883     0.67         TRUE

Show planet names with rings:

planets_df[planets_df$rings_vector==TRUE, "name"]
## [1] "Jupiter" "Saturn"  "Uranus"  "Neptune"

Notice that R has a built-in function to select a subset of a data frame called subset(). See R help for the required arguments and options. For instance, we show below the data only for the planets that have rings:

subset_df<-subset(planets_df, subset = planets_df$rings_vector==TRUE)
print(subset_df)
##      name      type diameter rotation rings_vector
## 5 Jupiter Gas giant   11.209     0.41         TRUE
## 6  Saturn Gas giant    9.449     0.43         TRUE
## 7  Uranus Gas giant    4.007    -0.72         TRUE
## 8 Neptune Gas giant    3.883     0.67         TRUE

we can order the rows of a data frame by the value of a variable. To do so, we use the built-in function order(). For instance, in the below example, we sort planets by diameter.

position<-order(planets_df$diameter)
planets_df[position,]
##      name               type diameter rotation rings_vector
## 1 Mercury Terrestrial planet    0.382    58.64        FALSE
## 4    Mars Terrestrial planet    0.532     1.03        FALSE
## 2   Venus Terrestrial planet    0.949  -243.02        FALSE
## 3   Earth Terrestrial planet    1.000     1.00        FALSE
## 8 Neptune          Gas giant    3.883     0.67         TRUE
## 7  Uranus          Gas giant    4.007    -0.72         TRUE
## 6  Saturn          Gas giant    9.449     0.43         TRUE
## 5 Jupiter          Gas giant   11.209     0.41         TRUE

The order is increasing by default, we can also specify a decreasing order:

position<-order(planets_df$diameter, decreasing=TRUE)
planets_df[position,]
##      name               type diameter rotation rings_vector
## 5 Jupiter          Gas giant   11.209     0.41         TRUE
## 6  Saturn          Gas giant    9.449     0.43         TRUE
## 7  Uranus          Gas giant    4.007    -0.72         TRUE
## 8 Neptune          Gas giant    3.883     0.67         TRUE
## 3   Earth Terrestrial planet    1.000     1.00        FALSE
## 2   Venus Terrestrial planet    0.949  -243.02        FALSE
## 4    Mars Terrestrial planet    0.532     1.03        FALSE
## 1 Mercury Terrestrial planet    0.382    58.64        FALSE

Notice that, at any time, to see the whole content of a dataframe we can go to the right-top window in RStudio and select the Environment tab, there if we double click on the name of our dataframe, RStudio will open a spreadsheet-like window in which we can see the whole content of the dataframe (all the rows, columns, and observations).

2.13. Create a function

A function is a set of statements organized together to perform a specific task. So far, we have relied on pre-existing functions (built in R already), but we can also create our own ones. Let us create a function that checks whether a letter is contained in a string (a word).

To create a function we need to follow a specific syntax. We must name the function and indicate the arguments (variables) it receives between the parentheses, in our case:

  1. letter: the letter we are looking for
  2. word: the word to search in

We also must indicate what the function does within curly brackets. The function checks whether the word provided by the user has the desired letter in it and let the user now by printing the information. We rely on the built-in function grepl() to know whether the letter is present in the word. For more information, refer to the help on grepl() in R or here.

my_function <- function(word,letter) {
test<-grepl(letter, word) 
  if (test == TRUE) {
  print(paste("The word",word, "has the letter",letter,"in it.",sep=" "))
  }
  else
  {
  print(paste("The word",word, "does not have the letter",letter,"in it.",sep=" "))
  }
} 

We now call our function to verify it works as expected:

my_function("alphabet","a")
## [1] "The word alphabet has the letter a in it."
my_function("barbecue","z")
## [1] "The word barbecue does not have the letter z in it."

2.14. If you are stuck…

You are about to tackle a series of exercises. When coding on your own, you will make mistakes and get stuck, that is how one learns programming. When this happens, I recommend going through these steps:

  • Check list:
    • Most of the time functions or variables are misspelled, first thing to double check.
    • When you are stuck, read the error messages, they give you clues, google them if needed.
    • Use R help to see which functions are available and what they do, how to call them properly, what type of arguments they require and so on…
    • Search for alternative ways to achieve the same goal.
    • Make use of the print() function to see what your variables contain, whether this is what you expect.
    • If a solution file is provided, carefully compare your code to it to spot your mistakes or clarify a potential misunderstanding.
    • R is open-source. Sharing coding issues and providing solutions is central to the community. Feel free to refer to forums and other R websites to see whether someone already encountered the issue you face and check the answers provided by the R community.
    • In last resort, contact your lecturer - if you do it without carefully considering the above options, you cannot expect an answer.

General way to go: turn a problem into small logical steps, and for each step, do some research on the best way to achieve it with R. For instance, if we want a function that compares two strings to know whether they match, we google it and find the right package/function to do so and then use R help to get to know how to use these functions.

A sustainable and efficient way to learn coding is to develop the ability to find help on the internet and correct your mistakes by yourself. When you will work on your own application or develop one for a company, you cannot expect someone to help you every time you are stuck. This is key for you to continue improving your programming skills once you will know the basics and be done with this course.

2.15. Exercises

To practice, please consider the following exercises. You should find your own way to solve the problems but a solution is provided for guidance - to see it you have to click the code button below each exercise.

Exercise 1

You are going to work with the data frame mtcars. Recall that it presents, in a data frame, data extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). You are asked to rank the automobiles by the variable mpg from the highest to the lowest. Then, find the names of the automobiles that have a cylinder (variable cyl) equal or above six only. Finally, compute and print the mean value for each numeric variables of the data frame.

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Exercise 1 - Possible solution

#order the automobiles by mpg, highest values first
print(mtcars[order(-mtcars$mpg),])
#show automobiles with cylinder >= 6
print(mtcars[mtcars$cyl>=6,])
#compute mean value for each variable and print it
means_of_col <- colMeans(mtcars) 
print(means_of_col)

Exercise 2

Create a function that shows the result of a multiplication by 10 of the number you give as an argument.

Exercise 2 - Possible solution

my_function <- function(number_to_multiply) {
  result <- number_to_multiply*10
  print(result)
  }
my_function(50)

Exercise 3

Find the R function that converts all of the characters in a string to upper case and create a function that takes as argument a word and print it in uppercase letters.

Exercise 3 - Possible solution

my_function <- function(word) {
  uppercase_word <- toupper(word)
  print(uppercase_word)
  }
my_function("test")

Exercise 4

create a function that prints 5 letters (randomly choosen) out of the word you give as argument.

Exercise 4 - Possible solution

my_function <- function(word) {
  
  #we have done some research, we will use the built-in function runif(). This function generates five random numbers between 1 and the length of the word we give to the function. We round these numbers to get integers.
  vector_random_number <- round(runif(5, 1, nchar(word)))
  print("The five random letter out the word provided are:")
  
  for (i in 1:5) {
    # we use the function substr() to extract the letter for the random position corresponding to the element i of the vector of random numbers.
    letter_to_show <- substr(word,vector_random_number[i], vector_random_number[i])
    print(letter_to_show)
  }
}
my_function("abcdefghijklmnopqrstuvwxyz")

3. Data sourcing, treatment, and visualization

From now on, the functions we use are just going to be named, then you are free to use R help or Internet to get help on what these functions achieve exactly and how to use them, it is a good habit to develop. Also, we are mostly going to work with data frames.

Let us assume we want to work on COVID data for France. We usually have two options. We can download the data first on our local computer, usually in .csv format and then import it to R, or download them directly with R from the Internet source.

In terms of standard data sources, when it comes to COVID data for France, but also more generally, we have plenty of choices, we can use for instance:

  1. Kaggle offers free data sets to interested data scientists
  2. For French public data, including those on COVID, data.gouv.fr is the platform on which public data is made available.
  3. Harvard College Open Data Project gathered dozens of public Harvard datasets in a free repository.

Whatever our starting point is, once we have the raw data, next we have to store them in a data frame to be able to treat,explore, and visualize them.

To do so, we are going to use packages created by R users, they include functions that are not built in R, and that can be very useful to analyze and visualize data of a data frame. We will see how to install and work with the following packages: dplyr, ggplot2, raster, and leaflet.

All the below pieces of code are available on Blackboard in the R files sections_31_37.R and section_38.R.

3.1. Importing a dataset

We go to data.gouv.fr and download data on COVID-related hospitalizations, reanimations, and deaths in France.

We want to download the file: “donnees-hospitalieres-covid19-2021-09-02-19h05.csv”. The link is the following.

Using R, we can download the file directly into a data frame and then explore it. We use the read.csv() function of R.

new_data_frame <- read.csv(file="https://www.data.gouv.fr/fr/datasets/r/63352e38-d353-4b54-bfd1-f1b3ee1cabd7", header=T, sep=";")

Alternatively, we can first download the file on our machine and rename it for instance, and then create a data frame using this csv file. This is a better practice:

new_data_frame <- read.csv("covid_data.csv", header=T, sep=";")

You can find the csv data file on Blackboard, under the name covid_data.csv.

Information about the variables of this dataset are available in the readme file attached to the data, available here. As you can read, the dataset includes, among others, the following variables:

  1. dep is an integer and code for the department number.
  2. sexe is an integer that code for whether the individuals are males or females.
  3. jour is the date of notice.
  4. hosp is the number of people currently hospitalized.
  5. rea is the number of people currently in resuscitation or critical care.
  6. rad is the total amount of patient that returned home.
  7. dc is the total amount of deaths at the hospital.

Notice that the file reports cumulative sums at a given point in time. Hence, it is okay to sum values across department for instance but it does not make sense to sum them across dates. We first check what the dataset looks like using the two functions introduced previously: head() and str():

head(new_data_frame)
##   dep sexe       jour hosp rea rad dc
## 1   1    0 18/03/2020    2   0   1  0
## 2   1    1 18/03/2020    1   0   1  0
## 3   1    2 18/03/2020    1   0   0  0
## 4   2    0 18/03/2020   41  10  18 11
## 5   2    1 18/03/2020   19   4  11  6
## 6   2    2 18/03/2020   22   6   7  5
str(new_data_frame)
## 'data.frame':    172630 obs. of  7 variables:
##  $ dep : chr  "1" "1" "1" "2" ...
##  $ sexe: int  0 1 2 0 1 2 0 1 2 0 ...
##  $ jour: chr  "18/03/2020" "18/03/2020" "18/03/2020" "18/03/2020" ...
##  $ hosp: int  2 1 1 41 19 22 4 1 3 3 ...
##  $ rea : int  0 0 0 10 4 6 0 0 0 1 ...
##  $ rad : int  1 1 0 18 11 7 1 0 1 2 ...
##  $ dc  : int  0 0 0 11 6 5 0 0 0 0 ...

3.2. Formatting the data

We create a date out of the time information provided, only a date formatted this way will be recognized by R. To do so, we use the as.Date() function, see documentation here.

new_data_frame$date<-as.Date(new_data_frame$jour, tryFormats = c("%d/%m/%Y"))
head(new_data_frame$date)
## [1] "2020-03-18" "2020-03-18" "2020-03-18" "2020-03-18" "2020-03-18"
## [6] "2020-03-18"

We drop the variables we are not interested in. To do so, we use the built-in function subset().

new_data_frame <- subset(new_data_frame, select = c (dep,sexe,hosp,rea,rad,dc,date))
head(new_data_frame)
##   dep sexe hosp rea rad dc       date
## 1   1    0    2   0   1  0 2020-03-18
## 2   1    1    1   0   1  0 2020-03-18
## 3   1    2    1   0   0  0 2020-03-18
## 4   2    0   41  10  18 11 2020-03-18
## 5   2    1   19   4  11  6 2020-03-18
## 6   2    2   22   6   7  5 2020-03-18

We change some column names so that they are more telling (the names of the variables of the data frame):

colnames(new_data_frame)<-c("department","gender","nb_hospitalizations","nb_reanimations","nb_returned_home","nb_deaths","date")
head(new_data_frame)
##   department gender nb_hospitalizations nb_reanimations nb_returned_home
## 1          1      0                   2               0                1
## 2          1      1                   1               0                1
## 3          1      2                   1               0                0
## 4          2      0                  41              10               18
## 5          2      1                  19               4               11
## 6          2      2                  22               6                7
##   nb_deaths       date
## 1         0 2020-03-18
## 2         0 2020-03-18
## 3         0 2020-03-18
## 4        11 2020-03-18
## 5         6 2020-03-18
## 6         5 2020-03-18

We want to indicate that the gender variable is a factor variable, because we do not want to treat its values as integers (numerical values). Recall that a factor variable is a vector taking a small number of distinct values. A factor variable is a variable used to categorize and store the data. We know from the information file on data.gouv.fr that 0 stands for males & females, 1 stands for males only, and 2 for females only.

new_data_frame$gender<-factor(new_data_frame$gender,labels=c("males & females","males","females"))
head(new_data_frame)
##   department          gender nb_hospitalizations nb_reanimations
## 1          1 males & females                   2               0
## 2          1           males                   1               0
## 3          1         females                   1               0
## 4          2 males & females                  41              10
## 5          2           males                  19               4
## 6          2         females                  22               6
##   nb_returned_home nb_deaths       date
## 1                1         0 2020-03-18
## 2                1         0 2020-03-18
## 3                0         0 2020-03-18
## 4               18        11 2020-03-18
## 5               11         6 2020-03-18
## 6                7         5 2020-03-18

3.3. Summary statistics

We can generate summary statistics of the data by calling the built-in function summary():

summary(new_data_frame)
##   department                    gender      nb_hospitalizations
##  Length:172630      males & females:57732   Min.   :   0       
##  Class :character   males          :57732   1st Qu.:  15       
##  Mode  :character   females        :57166   Median :  48       
##                                             Mean   : 112       
##                                             3rd Qu.: 128       
##                                             Max.   :3281       
##  nb_reanimations  nb_returned_home   nb_deaths           date           
##  Min.   :  0.00   Min.   :    0    Min.   :   0.0   Min.   :2020-03-18  
##  1st Qu.:  1.00   1st Qu.:  197    1st Qu.:  43.0   1st Qu.:2020-08-06  
##  Median :  5.00   Median :  581    Median : 132.0   Median :2020-12-25  
##  Mean   : 16.96   Mean   : 1365    Mean   : 311.3   Mean   :2020-12-25  
##  3rd Qu.: 17.00   3rd Qu.: 1501    3rd Qu.: 365.0   3rd Qu.:2021-05-16  
##  Max.   :855.00   Max.   :25253    Max.   :4759.0   Max.   :2021-10-04

3.4. Analyzing the data with dplyr

Now we have formatted the dataset in a data frame, we can start exploring it in more details. Let us say that we want to know and show the total number of hospitalizations, deaths, or people in critical cares by department or at specific date. To do so, we are going to use a package of R that has been created to analyze data frames. The package is dplyr, you can find more information about it here. If the package is not already installed on your machine. You need to install it.

To do so, use the command:

install.packages("dplyr")

Then, to use the functions in this package, we load into our package library the package dplyr at the beginning of the script by calling the function library().

library(dplyr)

The main functions of the package are:

  1. select() : Select certain columns of data.
  2. filter() : Filter your data to select specific rows.
  3. arrange() : Arrange the rows of your data into an order.
  4. mutate() : Mutate your data frame to contain new columns, useful to crate new variables.
  5. summarize() : Summarize chunks of you data in some way.
  6. group_by() : group data by a variable, ideally a factor variable.

dplyr works with so called pipes (%>%). Pipes take the output from one function and feed it to the first argument of the next function. This way, we can combine the above functions, as we will see shortly.

To select some variables of a data frame, we can call the dplyr function select(), and using pipes, we go:

#Here we create a new data frame consisting of the variables nb_hospitalizations, nb_reanimations, nb_deaths, and department only.
new_data_frame_2 <- new_data_frame %>% select(c(nb_hospitalizations,nb_reanimations,nb_deaths,department))
head(new_data_frame_2)
##   nb_hospitalizations nb_reanimations nb_deaths department
## 1                   2               0         0          1
## 2                   1               0         0          1
## 3                   1               0         0          1
## 4                  41              10        11          2
## 5                  19               4         6          2
## 6                  22               6         5          2

If we want to show the data for one department only, let us say the department “75”, we can use filter():

new_data <- new_data_frame %>% filter(department == "75")
head(new_data)
##   department          gender nb_hospitalizations nb_reanimations
## 1         75 males & females                 359             105
## 2         75           males                 217              70
## 3         75         females                 139              35
## 4         75 males & females                 453             122
## 5         75           males                 279              85
## 6         75         females                 170              37
##   nb_returned_home nb_deaths       date
## 1               40        14 2020-03-18
## 2               22        10 2020-03-18
## 3               18         4 2020-03-18
## 4               62        22 2020-03-19
## 5               31        16 2020-03-19
## 6               30         6 2020-03-19

We can summarize the data by department for males & females only at a specific date using the filter() function. For instance, to show the number of hospitalizations for males and females as of 2021-08-08, we write:

new_data <-  new_data_frame %>% filter(date=="2021-08-08" & gender=="males & females") %>% select(c(nb_hospitalizations,department))
head(new_data,10)
##    nb_hospitalizations department
## 1                   68          1
## 2                   37          2
## 3                   15          3
## 4                   43          4
## 5                   23          5
## 6                  249          6
## 7                   26          7
## 8                    2          8
## 9                   31          9
## 10                  26         10

We can achieve about the same result (the difference is caused by unspecified gender values) by filtering out males and females separately and then sum the values by department for the same date as above. To sum by group, we first group the observation by a variable (department in this case), using the function group_by() and then, we call summarize(). More information on summarize can be found here:

new_data <-  new_data_frame %>%  filter(date=="2021-08-08" & gender!="males & females")  %>% select(c(nb_hospitalizations,department)) %>% group_by(department)  %>% summarize(sum_nb_hostpitalizations_20210808=sum(nb_hospitalizations))
head(new_data,10)
## # A tibble: 10 x 2
##    department sum_nb_hostpitalizations_20210808
##    <chr>                                  <int>
##  1 1                                         68
##  2 10                                        26
##  3 11                                        75
##  4 12                                        25
##  5 13                                       656
##  6 14                                        73
##  7 15                                         4
##  8 16                                        18
##  9 17                                        81
## 10 18                                         9

To show the number of deaths by gender as of a certain date, we combine filter(), select(), group_by(), and summarize():

new_data <- new_data_frame %>%  filter(date=="2021-08-08" & gender!="males & females") %>% select(c(nb_deaths,gender)) %>% group_by(gender) %>% summarize(tot_deaths_by_gender_20210808=sum(nb_deaths))
head(new_data,10)
## # A tibble: 2 x 2
##   gender  tot_deaths_by_gender_20210808
##   <fct>                           <int>
## 1 males                           49185
## 2 females                         35922

Let us say we want to create a new variable that is equal to the ratio between the total number of hospitalizations and the total number of deaths. To create a new variable with dplyr, we can use the function mutate().

new_data <- new_data_frame %>%  filter(nb_deaths>0) %>%  mutate(ratio_hospitalizations_to_deaths=nb_hospitalizations/nb_deaths)
head(new_data,10)
##    department          gender nb_hospitalizations nb_reanimations
## 1           2 males & females                  41              10
## 2           2           males                  19               4
## 3           2         females                  22               6
## 4           6 males & females                  25               1
## 5           6         females                  10               0
## 6          11 males & females                   8               7
## 7          11           males                   6               5
## 8          11         females                   2               2
## 9          13 males & females                  98              11
## 10         13           males                  50               6
##    nb_returned_home nb_deaths       date ratio_hospitalizations_to_deaths
## 1                18        11 2020-03-18                         3.727273
## 2                11         6 2020-03-18                         3.166667
## 3                 7         5 2020-03-18                         4.400000
## 4                47         2 2020-03-18                        12.500000
## 5                28         2 2020-03-18                         5.000000
## 6                 9         3 2020-03-18                         2.666667
## 7                 2         2 2020-03-18                         3.000000
## 8                 7         1 2020-03-18                         2.000000
## 9                59         4 2020-03-18                        24.500000
## 10               35         2 2020-03-18                        25.000000

Next, we see how to show, with plots, some features and insights about our data.

Before that, let us know practice a bit, you have to show the following information:

  1. Show for females only the number of reanimations as of end of 2020. A possible solution code is hidden below.
exo_1 <- new_data_frame %>%  filter(gender=="females" & date=="2020-12-31") %>% select(c(nb_reanimations)) 
head(exo_1,10)
  1. Show the time series of total death for males & females for the department 44 = all the dates with corresponding sum. A possible solution code is hidden below.
exo_2 <- new_data_frame %>%  filter(gender=="males & females" & department==44) %>% select(c(date,nb_deaths))
head(exo_2,10)
  1. By department, over the entire period, show the ratio of the number of male deaths to the the number of female deaths. A possible solution code is hidden below.
exo_3 <- new_data_frame %>%  filter(gender!="males & females") %>% select(c(department,nb_deaths,gender)) %>% group_by(department,gender) %>% summarize(mysum=sum(nb_deaths))
exo_3 <- exo_3  %>% group_by(department) %>% summarize(ratio = mysum[gender=="males"]/mysum[gender=="females"])
head(exo_3,10)
  1. Based on the data of point 3, show the ten departments wit the lowest ratio.
exo_4 <- exo_3  %>% arrange(ratio)
head(exo_4,10)

3.5. Visualizing the data with ggplot2

We use the package ggplot2 to map variables to nice graphics. More information about the package can be found here. We start by installing and loading the package’s functions to R.

install.packages("ggplot2")
library(ggplot2)

Now, we can use the package’s functions. Here you can find a great cheatsheet summarizing the key functions, what they are their arguments, and how to call them.

Let us start with something simple. We want to show the total number of deaths for males and females together over time on a chart (that is we take the sum across departments). To that end, we use first dplyr to get the data we are after and then we construct a plot with ggplot.

#We retrieve the total number of deaths for the group (males & females) and sum across department for each available date
new_data <- new_data_frame %>%  filter(gender=="males & females") %>% select(c(nb_deaths,date,department)) %>% group_by(date) %>%  summarize(sum_nb_deaths=sum(nb_deaths))
#Then we show it on a graph using the ggplot function (package ggplot) - we draw a line using geom_line()
ggplot(data=new_data,aes(x=date,y=sum_nb_deaths))+ geom_line()

Notice the way we add features to a plot with ggplot(). We add a feature to the plot by calling additional parameters with the sign “+”.

We can change the theme of the plot, for instance we indicate that we want to use the theme classic.

ggplot(data=new_data,aes(x=date,y=sum_nb_deaths))+ geom_line() + theme_classic()

We can change the color and size of the line we draw.

ggplot(data=new_data,aes(x=date,y=sum_nb_deaths))+ geom_line(colour = "blue", size = 1) + theme_classic()

We can add a title, change the axis labels, add more values on the vertical axis, and show the axis labels vertically.

ggplot(data=new_data,aes(x=date,y=sum_nb_deaths)) + geom_line(colour = "blue") + theme_classic() + labs(x = "Date", y="Number of deaths")+ ggtitle("Number of deaths from COVID in France over time")+ theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+theme(plot.title = element_text(hjust = 0.5)) +scale_x_date(date_breaks = "1 month") +scale_y_continuous(n.breaks=30)

Now, we want to show the same chart but for the three gender groups (males+ females, males, and females):

#We retrieve the total number of deaths for the groups (males & females, males, and females) and sum across department for each available date and group.
new_data <- new_data_frame %>% select(c(nb_deaths,date,gender)) %>% group_by(date,gender) %>% summarize(sum_nb_deaths=sum(nb_deaths))
#Then we show it on a graph using the ggplot function (package ggplot)
ggplot(data=new_data,aes(x=date,y=sum_nb_deaths,group=gender)) + geom_line(aes(color=gender))+ theme_classic() + labs(x = "Date", y="Number of deaths")+ ggtitle("Number of deaths from COVID in France over time by gender")+ theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+theme(plot.title = element_text(hjust = 0.5))+scale_x_date(date_breaks = "1 month")+scale_y_continuous(n.breaks=30)+scale_color_discrete("Gender category")

We can also show the total number of deaths by gender (males vs. females) in the form of a bar chart (see the function geom_col()) as of a specific date:

#We retrieve the total number of deaths for the groups (males, and females) and sum across department at a given date
new_data <- new_data_frame %>%  filter(date=="2021-07-08" & gender!="males & females") %>% select(c(nb_deaths,gender)) %>% group_by(gender)  %>% summarize(sum_nb_deaths=sum(nb_deaths))
#we use the geom_col() function of ggplot, see R help
ggplot(data=new_data,aes(x=gender,y=sum_nb_deaths)) + geom_col(aes(fill=gender))+ theme_classic() +  ggtitle("Total number of deaths by gender as of 2021-07-08")+labs(fill = "Gender category",x="", y="Number of deaths")

We can show the observations we have in our data set for the number of hospitalizations over time, for males and females, for the department 44, as points or dots.

new_data <- new_data_frame %>% filter(department=="44" & gender!="males & females")  %>% select(c(nb_hospitalizations,gender,date))
#we use the geom_point() function of ggplot, see R help
ggplot(data=new_data,aes(x=date,y=nb_hospitalizations,fill=gender)) + geom_point(size=1, shape=23)+ theme_classic() + labs(y="Number of deaths")+ ggtitle("Number of hospitalizations in France over time by gender")+labs(fill = "Gender categories",x="Date", y="Number of hospitalizations")+theme(plot.title = element_text(hjust = 0.5))+scale_x_date(date_breaks = "1 month")+theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+scale_y_continuous(n.breaks=30)

Finally, we can create a heatmap for the total number of deaths for males or females over a data range:

new_data <- new_data_frame %>% filter(gender!="males & females" & date>="2020-08-08" & date<="2021-08-08") %>%  select(c(nb_deaths,gender,date)) %>% group_by(date,gender) %>% summarize(sum_nb_deaths=sum(nb_deaths))
ggplot(new_data, aes(x = date, y = gender, fill=sum_nb_deaths)) + geom_tile() + theme(plot.title = element_text(hjust = 0.5)) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + labs(fill="Nb deaths", x="Date", y="Gender category") + ggtitle("Nb deaths by date and gender")

Notice, that more themes are available if we install and load the package ggthemes.

install.packages("ggthemes")
#We load the package
library(ggthemes)

We can now show the observations we have in our data set for the number of hospitalizations over time, for males and females, for the department 44, as points or dots, using the theme The Economist (the magazine):

new_data <- new_data_frame %>% filter(department=="44" & gender!="males & females")  %>% select(c(nb_hospitalizations,gender,date))
#we use the geom_dotplot function of ggplot, see R help
ggplot(data=new_data,aes(x=date,y=nb_hospitalizations,fill=gender)) + geom_point(size=1, shape=23)+ theme_economist() + labs(y="Number of deaths")+ ggtitle("Number of hospitalizations in France over time by gender")+labs(fill = "Gender categories",x="Date", y="Number of hospitalizations")+theme(plot.title = element_text(hjust = 0.5))+scale_x_date(date_breaks = "1 month")+theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+scale_y_continuous(n.breaks=30)

3.6. Merging datasets

We wish to add the population per department to our dataset to be able to scale the number of deaths or hospitalizations. To that end, we download the following xlsx file from INSEE. We only keep the spreadsheet 2021, rename some columns, and save it under pop_2021.csv (available on Blackboard).

#see a bit a later why we need to import the package stringr
library(stringr)
library(dplyr)
#retrieve data
pop_data <- read.csv("pop2021.csv", sep=";")
#change the column names of the data frame
names(pop_data) <- c("department","name_dep","pop")
head(pop_data)
##   department                name_dep       pop
## 1          1                     Ain   662 244
## 2          2                   Aisne   525 503
## 3          3                  Allier   331 745
## 4          4 Alpes-de-Haute-Provence   165 702
## 5          5            Hautes-Alpes   140 022
## 6          6         Alpes-Maritimes 1 089 270
#because the population is coded as string with some spaces between numbers, R cannot interpret it as a number.
#We need to remove spaces between numbers - to do so we nee to load the package stringr, and, in particular, use the function str_replace.
pop_data$pop<-str_replace_all(pop_data$pop, pattern=" ", repl="")
#then we convert the strings to numerical values, using the function as.numeric()
pop_data$pop<-as.numeric(pop_data$pop)
head(pop_data)
##   department                name_dep     pop
## 1          1                     Ain  662244
## 2          2                   Aisne  525503
## 3          3                  Allier  331745
## 4          4 Alpes-de-Haute-Provence  165702
## 5          5            Hautes-Alpes  140022
## 6          6         Alpes-Maritimes 1089270

Next we merge the data we have: we add the department population to the COVID dataset. To do so, we use the dplyr function merge():

#we isolate data on the number of deaths by department for males & females as of the 2021-08-31
new_data <- merge(new_data_frame, pop_data, by.x = "department", by.y = "department")
head(new_data)
##   department          gender nb_hospitalizations nb_reanimations
## 1          1         females                  13               0
## 2          1           males                  49               4
## 3          1         females                 155               4
## 4          1 males & females                 223              21
## 5          1 males & females                 161              14
## 6          1 males & females                 145              16
##   nb_returned_home nb_deaths       date name_dep    pop
## 1              222        44 2020-08-16      Ain 662244
## 2             1418       364 2021-06-20      Ain 662244
## 3              716       147 2020-12-24      Ain 662244
## 4             1677       411 2021-01-15      Ain 662244
## 5             2282       547 2021-03-19      Ain 662244
## 6             2770       608 2021-05-23      Ain 662244

We can now create and show the ratio of COVID deaths to population by department at a given date to increase comparability of the information we provide across department. We sort departments according to the ratio, using the arrange() function. We display the ten departments with the highest ratios:

new_data2<- new_data %>% filter(date=="2021-07-08" & gender=="males & females") %>% mutate(deaths_pop_ratio=(nb_deaths/pop)*100) %>% select(c(department,name_dep,deaths_pop_ratio))
new_data2<- new_data2  %>%  arrange(-deaths_pop_ratio)
head(new_data2,10)
##    department              name_dep deaths_pop_ratio
## 1          90 Territoire de Belfort        0.4329348
## 2          88                Vosges        0.2393938
## 3          57               Moselle        0.2321752
## 4          52           Haute-Marne        0.2211245
## 5          55                 Meuse        0.2133852
## 6          75                 Paris        0.2128488
## 7           2                 Aisne        0.2121777
## 8          94          Val-de-Marne        0.2105639
## 9          71        Saône-et-Loire        0.2021269
## 10         68             Haut-Rhin        0.1991356

Finally, we can create a pie chart out of the data we constructed:

#Basic pie chart
ggplot(data=new_data2[0:10,], aes(x="", y=deaths_pop_ratio, fill=name_dep)) +
  geom_bar(stat="identity", width=1, color="white") +
  coord_polar("y", start=0) +
  theme_void()

3.7. Exercises

Now that we know how to import data into a data frame, prepare our dataset, and show graphical outputs, let us practice. Based on the same set of data, you are asked to do the following tasks (a possible solution is hidden, you can click the button Code to unhide it).

Exercise 1

Create a bar plot showing the number of deaths by gender (male or female) as of end of March 2021.

new_data <- new_data_frame %>% filter(date=='2021-03-31'& gender!="males & females") %>% select(c(nb_deaths,gender)) %>% group_by(gender) %>% summarize(sum_nb_deaths=sum(nb_deaths))
ggplot(data=new_data,aes(gender,sum_nb_deaths)) + geom_col(aes(fill=gender))+ theme_classic() +  ggtitle("Total number of deaths from COVID as of 2020/03/31")+labs(fill = "Gender categories",x="", y="Number of deaths")

Exercise 2

Show the number of reanimations (as a % of the population) in intensive care for males & females over time in the form of a line plot. Hint: we need to summarize across departments.

new_data <- merge(new_data_frame, pop_data, by.x = "department", by.y = "department")
new_data2 <- new_data %>% filter(gender=="males & females") %>%  group_by(date) %>% mutate(sum_reanimations=sum(nb_reanimations))  %>% mutate(sum_pop=sum(pop)) %>% mutate(pct_reanimations=(sum_reanimations/sum_pop)*100) %>%  select(c(pct_reanimations,date))

ggplot(data=new_data2,aes(x=date,y=pct_reanimations)) + 
  geom_line(size=1) + theme_economist() + 
  ggtitle("% French population in intensive care over time")+
  theme(plot.title = element_text(hjust = 0.5))+
  scale_x_date(date_breaks = "1 month")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
  scale_y_continuous(n.breaks=20)+
  theme(axis.title.x = element_blank(),axis.title.y = element_blank())

Exercise 3

Go to Kaggle, register, and download a dataset (.csv), load it into a data frame with R, and explore it with dplyr & ggplot (i.e., generate summary statistics and plots). For instance, you can find the Forbes list of worlds billionaires in 2018 there as well as some information for each billionaire in the list and then produce some summary statistics by country, gender, or industry.

3.8. Visualizing data on a map

Because we have the information by department, representing the information on a map can be a quick way to convey the gist of the data and share some insights. There is a first package in R that contains a collection of functions that will help us to design a map and custom it based on some values per department that we can find in our dataset. We need to import the package raster to get coordinates of regions and departments.

install.packages("raster")
library(raster)

We can use the package raster to create a map of the percentage of a department’s population that died from the COVID-19 as of March 2021:

#We restart from the step in which we add population data
new_data <- merge(new_data_frame, pop_data, by.x = "department", by.y = "department")
#we extract data at the department level as of end of March 2021
covid_death_per_dep <- new_data %>% filter(date=="2021-03-31" & gender=="males & females") %>% mutate(deaths_pop_ratio=(nb_deaths/pop)*100) %>% select(c(department,name_dep,deaths_pop_ratio))

#We load the map data of France - at the department level, see getData(). The object we get is a geospatial data frame.
formes <- getData(name="GADM", country="FRA", level=2)
#looking at the data, we see that NAME_2 gives the list of the departments in our forme file
head(formes)
##   GID_0 NAME_0   GID_1               NAME_1 NL_NAME_1     GID_2      NAME_2
## 1   FRA France FRA.1_1 Auvergne-Rhône-Alpes      <NA> FRA.1.1_1         Ain
## 5   FRA France FRA.1_1 Auvergne-Rhône-Alpes      <NA> FRA.1.2_1      Allier
## 6   FRA France FRA.1_1 Auvergne-Rhône-Alpes      <NA> FRA.1.3_1     Ardèche
## 7   FRA France FRA.1_1 Auvergne-Rhône-Alpes      <NA> FRA.1.4_1      Cantal
## 8   FRA France FRA.1_1 Auvergne-Rhône-Alpes      <NA> FRA.1.5_1       Drôme
## 9   FRA France FRA.1_1 Auvergne-Rhône-Alpes      <NA> FRA.1.6_1 Haute-Loire
##      VARNAME_2 NL_NAME_2      TYPE_2  ENGTYPE_2 CC_2 HASC_2
## 1         <NA>      <NA> Département Department   01  FR.AI
## 5 Basses-Alpes      <NA> Département Department   03  FR.AL
## 6         <NA>      <NA> Département Department   07  FR.AH
## 7         <NA>      <NA> Département Department   15  FR.CL
## 8         <NA>      <NA> Département Department   26  FR.DM
## 9         <NA>      <NA> Département Department   43  FR.HL
#in our COVID dataset and in our map data, there are department names we can use to do a matching
#We cannot use the merge function here because formes is a geospatial data frame, 
#Hence, we use the following procedure to add the COVID data to the map data:
#we create an index linking the department of both datasets. 
#we make us of the match function of R: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/match
# match() returns a vector of the positions of (first) matches of its first argument in its second.
idx <- match(formes$NAME_2, covid_death_per_dep$name_dep)
#Do head(idx), idx tells us that the first NAME of formes$NAME_2 has to be linked to the first name of covid_death_per_dep$name_dep for instance, 
#but the second name of formes$NAME_2 with the third name of covid_death_per_dep$name_dep
#so IDX gives us the row positions for the department names of formes in new_data_2
#we extract the death_to_pop ratio from new_data2 for the positions given by idx
concordance <- covid_death_per_dep[idx, "deaths_pop_ratio"]
#and can now add them in the right order to the geospatial data frame
formes$deaths_pop_ratio <- concordance

#we select the map colors
couleurs <- colorRampPalette(c('white', 'red'))

#we call spplot to draw a map that shows the death_per_pop by department, using a color code to indicate the severity of the COVID effect.
spplot(formes, "deaths_pop_ratio",col.regions=couleurs(100),  main=list(label="% population of department that died from COVID-19 as of 2021-03-31",cex=.8))

Now we will use the package leaflet to create a similar map. It offers more features. More on leaflet for R [there] (https://rstudio.github.io/leaflet/). It works with map data from website such as GoogleMap or MapBox. To have access t othe map data from MapBox, we need to access MapBox API. API stands for Application Programming Interface and is a way to access to public data shared by websites such as Facebook, Twitter, Spotify, GoogleMap, or, in this case, MapBox. The access to MapBox API, as it is often the case for API, requires a key. Make sure you create a MapBox account (free) there and retrieve your own API key for MapBox before working on the below code (the API key I use is hidden, you must replace My_API_KEY in the below code by your own API key for it to work as intended:

install.packages("leaflet")
library(leaflet)
#we use the same forme file as we created above (in the raster example)
#we create a MapBox account first and retrieve the token so the we can use the API
m <- leaflet(formes) %>%
 addProviderTiles("MapBox", options = providerTileOptions(
    id = "mapbox.light",
    accessToken = Sys.getenv(MY_API_KEY)))

#create colors
pal <- colorBin("YlOrRd", domain = formes$deaths_pop_ratio, bins = 8)

#create labels
formes$label <- paste(formes$NAME_2,": ",round(formes$deaths_pop_ratio,2),"% deaths from COVID in the pop",sep="")
labels <- formes$label

#add polygons with colors and labels
m<- m %>% addPolygons(
  fillColor = ~pal(deaths_pop_ratio),
  weight = 2,
  opacity = 1,
  color = "white",
  dashArray = "3",
  fillOpacity = 0.7,
 highlight = highlightOptions(
    weight = 5,
    color = "#666",
    dashArray = "",
    fillOpacity = 0.7,
    bringToFront = TRUE), label = labels, labelOptions = labelOptions(style = list("font-weight" = "normal", padding = "3px 8px"), textsize = "15px",direction = "auto"))

#add legends
m <- m %>% addLegend(pal = pal, values = ~deaths_pop_ratio, opacity = 0.7,position = "bottomright",title = "% of deaths to pop")

#show leaflet map
m

One can also use the default map provider of leaflet (OpenStreetMap) that does not require an API key (notice that it is important to know how to use a package that requires an API key - because it is common among packages retrieving public data from online sources, for instance from Facebook or Twitter):

#we use the same forme file as we created above (in the raster example)
#we use the default map provider (OpenStreetMap) that does not require an API key
m <- leaflet(formes)

#create colors
pal <- colorBin("Blues", domain = formes$deaths_pop_ratio, bins = 8)

#create labels
formes$label <- paste(formes$NAME_2,": ",round(formes$deaths_pop_ratio,2),"% deaths from COVID in the pop",sep="")
labels <- formes$label

#add polygons with colors and labels
m<- m %>% addPolygons(
  fillColor = ~pal(deaths_pop_ratio),
  weight = 2,
  opacity = 1,
  color = "white",
  dashArray = "3",
  fillOpacity = 0.7,
 highlight = highlightOptions(
    weight = 5,
    color = "#666",
    dashArray = "",
    fillOpacity = 0.7,
    bringToFront = TRUE), label = labels, labelOptions = labelOptions(style = list("font-weight" = "normal", padding = "3px 8px"), textsize = "15px",direction = "auto"))

#add legends
m <- m %>% addLegend(pal = pal, values = ~deaths_pop_ratio, opacity = 0.7,position = "bottomright",title = "% of deaths to pop")

#show leaflet map
m

3.9. Application

Find a dataset on Kaggle related to the COVID and from which you can display data on a map. To guide you, we provide below another application with a dataset from Kaggle (a world map this time). We want to show information about vaccination progress across countries in the world. We download this dataset - also available on Blackboard under world_vac.csv (data as of the 2021-10-04).The R file to implement this application is available on Blackboard under world_vac_case.R.

world_vac<-read.csv("world_vac.csv",sep=",")
world_vac<- world_vac %>% select(-c(Doses.Administered,Doses.per.1000,Vaccine.being.used.in.a.country))
#we use the rename() function of dplyr to change the name of some columns in our data frame (new name = old name)
world_vac<- world_vac %>% rename(pct_pop_vaccinated=Fully.Vaccinated.Population....)
world_vac<- world_vac %>% rename(country=Country)
head(world_vac)
##         country pct_pop_vaccinated
## 1         World               35.3
## 2         China               75.2
## 3         India               18.2
## 4 United States               56.7
## 5        Brazil               44.9
## 6         Japan               61.0
str(world_vac)
## 'data.frame':    207 obs. of  2 variables:
##  $ country           : chr  "World" "China" "India" "United States" ...
##  $ pct_pop_vaccinated: num  35.3 75.2 18.2 56.7 44.9 61 19.7 54.7 64.8 36.3 ...

Now we draw the map. We use the package rworldmap this time because it suits our purpose.

library(rworldmap)
#Then, we join the vaccination data to the world map of the library calling the joinCountryData2Map() function
joinData <- joinCountryData2Map( world_vac,
                                 joinCode = "NAME",
                                 nameJoinColumn = "country")
## 199 codes from your data successfully matched countries in the map
## 8 codes from your data failed to match with a country code in the map
## 44 codes from the map weren't represented in your data
joinData <- subset(joinData, continent != "Antarctica")
#we plot the map
theMap <- mapCountryData(joinData, nameColumnToPlot="pct_pop_vaccinated",mapTitle="Percentage of the population vaccinated by country as of 2021-10-04")

Now, it is your turn.

4. Introduction to Shiny

You can find a good tutorial to prepare this part of the course here. For a more complete introduction to Shiny, please refer to this link, I recommend studying it in advance, before the class.

First, we install and load Shiny.

install.packages("shiny")
library(shiny)

4.1. A first app with Shiny

To create a Shiny app, a good practice is to create a new directory for your app, and put a single file called app.R in it for instance. To create a new directory use the explorer in the bottom right of RStsudio. All the file (images, csv files) you need to use in your app need to be stored in the same folder. We write the following code in app_ex_1.R, the file is available on Blackboard. We report the code below but the outcome of the code cannot be shown directly in this document, we must run it in RStudio.

#Load the shiny package
library(shiny)
#There is the  user interface (ui) part: we set up what the user will see - we define the HMTL page users will interact with.
#currently, is show Hello, world!
ui <- fluidPage(
  "Hello, world!"
)
#There is the server part: we set up the computations we do and how the app reacts to inputs from an user
#it is currently empty, so the app does not do anything
server <- function(input, output, session) {
}
#we create the APP - it will start the app on your local machine - later on we will see how to publish this app online.
shinyApp(ui, server)

Run the code in RStudio to see the app it creates.

The code tells Shiny both how our app should look, and how it should behave. The code defines the user interface and specifies the behavior of our app by defining a server function. Finally, it executes shinyApp(ui, server) to construct and start a Shiny application from ui and server. The app runs locally (http://127.0.0.1) as it is reported in the console. We will see later on how to publish it online.

Next, we want to add some inputs and outputs to our user interface (ui). We are going to make a very simple app that shows to the user all the variables included in a data frame. The user can then pick the variables she wants information on. We will use the following Shiny functions:

  1. fluidPage() is a layout function that sets up the basic visual structure of the page.
  2. selectInput() is an input control that lets the user interact with the app by providing a value. In this case, the user uses a select box with the label “variables”, we let the user choose one of the variables of the dataset we preload.
  3. verbatimTextOutput() and tableOutput() are output controls that tell Shiny where to put rendered output. verbatimTextOutput() displays text and tableOutput() displays tables.

The steps are the following: we pre-load the dataset, we pre-load the variables names in the select box, the user chooses a variable in the select box, based on the choice of the user the app displays back some information about the variable.

We use the same COVID dataset as before: covid_date.csv.

We create app_ex_2.R as follows, also available on Blackboard:

#Load the shiny and dplyr packages
library(shiny)
library(dplyr)

#we get COVID data from our usual source and format it the way we did before
new_data_frame <- read.csv(file="covid_data.csv",header=T,sep=";")
new_data_frame <- new_data_frame %>% select(c(dep,sexe,jour,hosp,rea,rad,dc))
new_data_frame$date<-as.Date(new_data_frame$jour, tryFormats = c("%d/%m/%Y"))
new_data_frame <- new_data_frame %>% select(c(-jour))
colnames(new_data_frame)<-c("department","gender","Number of hospitalizations","Number of reanimations","Number of returned home","Number of deaths","date")
new_data_frame$gender<-factor(new_data_frame$gender,labels=c("males & females","males","females"))

#create the user interface (ui), this time we specify more things:
ui <- fluidPage(
  #we create a select box containing all the variables of the data frame new_data_frame. The name of this select box is variable.
  varSelectInput("variable", label = "Please choose a variable", new_data_frame),
  #we show the text output called summary that is created by our server function upon an action of the user
  verbatimTextOutput("summary"),
  #we show the table called table that is created by our server function upon an action of the user
  tableOutput("table")
)

#we store the server functions here - where we perform calculations based on the user's choices
server <- function(input, output, session) {
  
  #we define a reactive, something that reacts to the user's actions, when the user changes something. We select the column of the dataframe based on the value selected in the select box by the user.
  dataset <- reactive({
    new_data_frame  %>% select(c(!!input$variable))
  })
  
  #we define a first output that we want to show, it is built-in summary function of R applied to the variable of the dataframe selected by the user, which we have called variable. We use renderPrint() function from Shiny so that it turns it into a piece of text to show to the user.
  output$summary <- renderPrint({
    summary(dataset())
  })
  
  #we define a second output that we want to show, it is the first five rows of the dataframe, we call head() We use renderTable() function from shiny so that it turns it into table to show to the user.
  output$table <- renderTable({
    head(dataset())
  })
}

#we create the app
shinyApp(ui, server)

Each render{Type} function is designed to produce a particular type of output (e.g., text, tables, or plots), and is often paired with a {type}Output function. For example, in the above app, renderPrint() is paired with verbatimTextOutput() to display a statistical summary with fixed-width (verbatim) text, and renderTable() is paired with tableOutput() to show the input data in a table.

Run the code in RStudio to see the app it creates.

4.2. Adding other functionalities and outputs

So far, the user can screen through the variables of the data frame and obtain summary information about the data. Now we let the user choose the data range of her choice. To do so, we need to add another select box. Based on the dates and the variable selected, we will show the total number by gender in the form of bar chart. We also restrict the variables the user can explore. We create app_ex_3.R, available on Blackboard (with comments). Run the code in RStudio to see the app it creates.

Next, we add another plot that shows a time series over time based on the variable choice. We create app_ex_4.R, available on Blackboard (with comments). Run the code in RStudio to see the app it creates.

4.3. Layout, information, titles, sections, fonts, and colors

We further customize our app. There are many layout options for us to change the way we show options.

We introduce a menu to navigate the outputs and we add some text to indicate what the dashboard is and what it allows the user to do. We use a panel layout to do so. We create app_ex_5.R, available on Blackboard (with comments). Run the code in RStudio to see the app it creates.

Next, we add a sidebar where the user will specify its inputs to concentrate them at the same place. Moreover, we play with the theme of the dashboard using the standard themes available from Shiny. We will also offer the user to exclude department 44 using a checkbox, and restrict our dataset to a total number of deaths lower than a certain threshold that the user will define. We create app_ex_6.R, available on Blackboard (with comments). Run the code in RStudio to see the app it creates.

4.4. Adding a map

Finally, we add map to our app, building on what we have learned. We create app_ex_7.R, available on Blackboard (with comments). To see it online, go there.

4.5. Publishing our app

The easiest way to turn your Shiny app into a web page is to use shinyapps.io, RStudio’s hosting service for Shiny apps. shinyapps.io lets us upload our app straight from our R session to a server hosted by RStudio. We have complete control over our app including server administration tools. To find out more about shinyapps.io, please visit the above link. We need to register there, then log in.

Then we connect Shiny to the freshly created web space by going to our account and then retrieving the token information from there. We use the token information as explained here.

Once our publishing account is well configured, we do a test. We will publish on our web space the application app_ex_7.R. Open in RStudio app_ex_7.R, then click the Publish Application button. Select app_ex_7.R and the files required for the data analysis (they should be selected by default), then validate.

Pick as a title for the application My_App_7. Once successfully published, you should be able to access the webpage with the following URL: https://YOURUSERNAME.shinyapps.io/My_App_7/. Of course, replace YOURUSERNAME by the user name you have chosen when you created your own account on shinyapps.io.

To see the one hosted on my shinyapps server, click here.

We now know how to create and publish and interactive dashboard with R.

A final touch: let us say we want to generate and share a QRcode for our online application, we can easily do that too with R, we use the below code (after installing the package qrcode):

#create qrcode
library("qrcode")
png("qrplot.png")
qrcode_gen("https://agarel86.shinyapps.io/My_App_7/")
dev.off()
#show it
library("imager")
im<-load.image("qrplot.png")
plot(im)

5. Cases

Now we will see, step-by-step, how to create and publish and online interactive Dashboard with R (Shiny) from scratch using other sets of data. Then, as your assignment for this course, you will have to create your own application using different data, and adequate app functions, Shiny interface and Shiny server computations.

Let us consider the three following applications. For each of them, there is a commented R file that creates the associated application and a link to an online version of the app:

  1. We want to show information about the possible side effects from COVID vaccines. The commented R code is available on Blackboard under case_1.R, and an online version is here. It requires the data file covid side effects US.csv available on Blackboard.

  2. We want to look at the emotion and sentiment conveyed by COVID-related tweets. You can find the commented R code on Blackboard under case_2.R, and an online version is here. It requires the data file tweetid_userid_keyword_sentiments_emotions_France.csv available on Blackboard.

  3. We want to show, for a given city, on a map, all the places where you can get vaccinated. You can find the commented R code on Blackboard under case_3.R, and an online version is here. It requires the data file centres_vaccinations.csv available on Blackboard.

6. Assessment

As a group of three, create your own online interactive dashboard with R. The only constraint: it has be to related to COVID-19 data in France or in another country. The dashboard should allow the user to visualize the data in the most complete and relevant way and respond to the user’s inputs. You can use the data source of your choice, feel free to consider data.gouv.org or Kaggle, they are many other sources.

The marking criteria (out of 20) are:

  1. Completeness : steps from getting the data to publishing an interactive dashboard online - 6 marks.
  2. Data analysis : data used, retreatments, choices given to the user, relevance of the data and of the analyses - 4 marks.
  3. Visuals : quality and appropriateness of the visuals (chart, map) used to render the information - 4 marks.
  4. Originality : Ability to depart from taught material and incorporate new features / analyses / tools - 3 marks.
  5. Readability : readability of the information on the dashboard - 1 mark.
  6. Comments : clarity and relevance of the comments in the code - 1 mark.
  7. Guidance : clarity and relevance of the guidance/instructions given to the user - 1 mark.

The mark is worth 25% of the final grade for the course. The due date is (December 25, 2021).

Good luck !

7. Contact information

You can contact your lecturer, Alexandre Garel, at . My personal website is there.