This course aims to teach you the fundamentals of programming with R, and then how to use Shiny to build an online interactive dashboard.
This link shows you a notorious illustration of an elaborated interactive dashboard covering the COVID-19 pandemic in France. This could be done with R & Shiny. Here is another example of what can be produced with R. Another example of Dashboard can be found here.
Dashboards are a great way to display metrics, explore a dataset, and visualize data in an interactive fashion (interactive means that the information presented to the user reacts to the user’s inputs).
Shiny provides a web framework for building interactive web applications with R. It works well on computers, tablets, and mobile devices. It allows to develop and publish a data analysis app.
In this course, you will learn how to create and publish and online interactive dashboard with R (Shiny). In 12 hours, we will learn how to create a basic online interactive dashboard.This course guides you through all the steps required to be able to do so. Your assessment will consist in developing your own dashboard and publishing it (as a team of 3).
Please also consider this course as an opportunity to discover R programming language and to open to all the things you can create with it. For instance, the website supporting this course has been developed with R (using the Markdown package).
Finally, notice that this course is not a course where you can go idle until couple of days before the final exam. It requires continuous preparation, dedication, and practice. Trial and error is key. You must try coding by yourself, most likely your code will not work at first, but you will also learn how to find help and solve coding problems when you are stuck.
There are three sessions of 4 hours. The first session covers sections 0 to 2 (November 4, 2021). The second session covers section 3 (November 18, 2021). The third session covers sections 4 to 5 (November 25, 2021). Section 6 provides details regarding the group project expected for this course.
Computers have their own language called machine code, which tells them what to do. A programming language provides an interface between a programmer and the machine language. When you write a code in the R programming language, you write a code that tells the machine what to do. The code is then compiled, which turns it into machine code the computer understands.
R is a programming language and free software. It possesses an extensive catalog of statistical and graphical methods. It includes machine learning algorithms, linear regression, time series, statistical inference to name a few. R is one of the dominant languages for data analysis in the business and finance industries.
To code with R, we will use RStudio. RStudio is an integrated development environment for R, with a console, syntax-highlighting editor that supports direct code execution, and tools for plotting, history, debugging and workspace management. RStudio should be already installed on all the machines, if not please contact the IT service. For more information about RStudio please visit this link. Notice that you can easily install R and RStudio on you own device, to do so, please check this link.
A great thing with R it that many users have already created so-called packages, which are collections of functions you can use to perform specific tasks. For instance, if you want to create a plot, you do not have to write a function from scratch, you can use a package, in that case ggplot2, which already has functions designed to create nice plots. The only thing you have to learn is what are the arguments these functions require, i.e, how to adequately call these functions.
You will find below a simple illustration of what can be achieved with few R code lines. We want to generate random numbers and then create a histogram of them. As you will see, some code lines start with a #, it means that those are comments, there for you to understand what is expected from the code (the machine is not going to read these lines).
library(ggplot2)
#create a data frame with 50 rows and 1 column
df <- data.frame(matrix(ncol = 1, nrow = 50))
#provide the column name
colnames(df) <- c('random_number')
#populate the column with random numbers
df$random_number <- runif(n = 50, min = 1, max = 10)
#plot a histogram of the random numbers with ggplot
ggplot(data = df, aes(x=random_number)) + geom_histogram(binwidth=0.5, color="black", fill="white") + labs(title="Random numbers histogram plot",x="Random numbers", y = "Count") + theme_dark()
Note that, on this website, you can see the output of the R code, it appears below the code. However, it is strongly recommended to also run it on your own machine, using RStudio, because it is on your machine that you can modify the code.
First, let us consider the interface of RStudio. As indicated here, there are four main windows. The script editor is where you type the code. By clicking the run button you can run the code. You can also execute R commands straight in the console window (bottom-left). On the top-right, there is the environment window where all the variables and datasets we create are listed. On the bottom-right, we can navigate files, consult the help, show plots, and explore the list of R packages installed on the machine.
Let us open RStudio and have a first go at programming with R. We want the machine to print “Hello”. That is we want to see “Hello” in the console of RStudio. To do so, we need to open RStudio, create an empty R script, copy paste the below piece of code, and then run it. To achieve what we want, we write the code print(“Hello”) in the script editor, and then we click the “Run” button. We should see in the console window the expected outcome:
print("Hello")
## [1] "Hello"
What we have done is calling the built-in function print() of R. To know how to use this function, and to find such a function in the first place, we can consult RStudio help (bottom-right window). We can search for a print() function there and once we find it, the help tells us what the function does, how to use it, and what are the required arguments.
A basic concept in programming is called a variable. A variable allows us to store a value (e.g., 5 or “soccer”) or an object (such as a function). We can then later use this variable’s name to easily access the value or the object that is stored within this variable. To assign a value to a variable, we use <-. Consider the below examples:
my_variable <- 5
print(my_variable)
## [1] 5
my_variable <- "soccer"
print(my_variable)
## [1] "soccer"
At any time, in RStudio you can go to the Menu and click the file tab to save your script. Go save as and then select the folder and type the name of the file. It will create and R file: xxx.R.
To print two variables together, we can use the built-in function paste(), which concatenates variables after converting them to characters. The paste function takes at least two arguments: the elements we want to combine and the choice of the character string to separate them. See the example below:
my_variable_A <- 5
my_variable_B <- "soccer"
variables_A_and_B <- paste(my_variable_A, my_variable_B, sep=" ")
print(variables_A_and_B)
## [1] "5 soccer"
There are numerous data types.
#Decimal values like 4.5 are floats, they belong to numerics.
dec <- 4.5
#Whole numbers like 4 are called integers. Integers are also numerics.
whole <- 9
#Boolean values (TRUE or FALSE) are called logical.
my_value <- TRUE
#Text (or string) values are called characters.
my_string <- "This is a string"
We can check the type of a variable by using the function class(), for instance:
my_value <- TRUE
class (my_value)
## [1] "logical"
We can easily perform some arithmetic calculations with R.
my_variable1 <- 5+6
print(paste("The result of the addition is", my_variable1,sep=" "))
## [1] "The result of the addition is 11"
my_variable2 <- 5-6
print(paste("The result of the substraction is", my_variable2,sep=" "))
## [1] "The result of the substraction is -1"
my_variable3 <- 5*6
print(paste("The result of the multiplication is", my_variable3,sep=" "))
## [1] "The result of the multiplication is 30"
my_variable4 <- 5/6
print(paste("The result of the division is", my_variable4,sep=" "))
## [1] "The result of the division is 0.833333333333333"
#Return on equity (ROE) calculation
firm_earnings <- 150
firm_equity <- 1000
firm_ROE <- firm_earnings/firm_equity
print(firm_ROE)
## [1] 0.15
Relational operator (comparators sur as ==, !=, >, <, >=, <=) help us see how one R object relates (compares) to another (are they equal or unequal for instance). The comparison returns TRUE or FALSE:
#equality ==
print (1==1)
## [1] TRUE
#equality ==
var1=1
var2=2
print (var1==var2)
## [1] FALSE
#inequality !=
print (1!=1)
## [1] FALSE
#inequality !=
print (1!=2)
## [1] TRUE
#less than or greater than
print (1<3)
## [1] TRUE
#less than or greater than
print (1>0)
## [1] TRUE
A logical operator is a symbol or word used to connect two or more objects (AND, OR, etc…):
#AND(&)
print (15 == 15 & 15> 13)
## [1] TRUE
#AND(&)
print (15 == 15 & 15> 16)
## [1] FALSE
#OR(|)
print (15 == 15 | "A"=="A")
## [1] TRUE
#OR(|)
print (15 == 17 | "A"=="A")
## [1] TRUE
A conditional statement is a statement that gives an algorithm the ability to make decision (for instance if it rains then show me the TV program). Below is a if statement stating that if the number of views (we assume those are Instagram views for instance) exceeds 15, then the user can be considered as popular…
num_views<-16
if (num_views > 15) {
print("You are popular!")
}
## [1] "You are popular!"
We can add an else statement to explicitly consider the case where the number of views does not exceed 15:
num_views<-10
if (num_views > 15) {
print("You are popular!")
} else {
print("You are not popular!")
}
## [1] "You are not popular!"
If we want to consider several alternatives, we can use else if statements:
num_views<-16
if (num_views <= 10) {
print("You are really not popular!")
} else if(num_views > 10 & num_views<= 15) {
print("You are not popular!")
} else if(num_views > 15 & num_views<= 20) {
print("You are popular!")
} else if(num_views > 20) {
print("You are really popular!")
}
## [1] "You are popular!"
With loops we can keep having the algorithm doing something while a condition holds. In the below example, we use a while loop that reduces the value of the variable speed as long as the condition (speed value > 30) is met:
speed <- 64
while (speed > 30) {
print(paste("Slow down! Your speed is:", speed, sep=" "))
speed <- speed - 7
}
## [1] "Slow down! Your speed is: 64"
## [1] "Slow down! Your speed is: 57"
## [1] "Slow down! Your speed is: 50"
## [1] "Slow down! Your speed is: 43"
## [1] "Slow down! Your speed is: 36"
We can combine a while loop with a break. For instance, in the below example we use a while loop to increase speed if speed is greater than 30. We add a break command, so that when the speed exceeds 80, we break the loop and stop the algorithm.
speed <- 35
while (speed > 30) {
speed<-speed+3
print(paste("Your speed is", speed))
# Break the while loop when speed exceeds 80
if (speed > 80) {
print("speed it too high, we stop and break the loop")
break
}
}
## [1] "Your speed is 38"
## [1] "Your speed is 41"
## [1] "Your speed is 44"
## [1] "Your speed is 47"
## [1] "Your speed is 50"
## [1] "Your speed is 53"
## [1] "Your speed is 56"
## [1] "Your speed is 59"
## [1] "Your speed is 62"
## [1] "Your speed is 65"
## [1] "Your speed is 68"
## [1] "Your speed is 71"
## [1] "Your speed is 74"
## [1] "Your speed is 77"
## [1] "Your speed is 80"
## [1] "Your speed is 83"
## [1] "speed it too high, we stop and break the loop"
There is also a for loop. The latter does an action for each value of a list of values. For instance, below, we create a list of prime numbers, and we then have our code telling the machine, with a for loop, to print each of the prime number included in that list. Notice that are they several ways to make that happen. We are going to see how to create a list very shortly.
primes <- c(2, 3, 5, 7, 11, 13)
for (p in primes) {
print(p)
}
## [1] 2
## [1] 3
## [1] 5
## [1] 7
## [1] 11
## [1] 13
for (i in 1:length(primes)) {
print(primes[i])
}
## [1] 2
## [1] 3
## [1] 5
## [1] 7
## [1] 11
## [1] 13
Vectors are one-dimension arrays that can hold numeric data, character data, or logical data. A vector is a simple tool to store data. For example, you can store your daily gains and losses in casinos. To create a vector containing A, B, and C, we use c(A, B, C). Sometimes, we refer to such vectors as lists.
numeric_vector <- c(1, 2, 3)
print(numeric_vector)
## [1] 1 2 3
character_vector <- c("a", "b", "c", "d")
print(character_vector)
## [1] "a" "b" "c" "d"
It is important to have a clear view on the data that you are using. Understanding what each element refers to is therefore essential. To help, we can name the elements of a vector.
character_vector <- c("a", "b", "c", "d")
names(character_vector)<- c("First letter", "Second letter", "Third Letter", "Fourth Letter")
print(character_vector)
## First letter Second letter Third Letter Fourth Letter
## "a" "b" "c" "d"
We can also perform calculations with vectors:
numeric_vectorA <- c(1, 2, 3)
numeric_vectorB <- c(1, 2, 3)
vector_A_plus_B<-numeric_vectorA+numeric_vectorB
print(vector_A_plus_B)
## [1] 2 4 6
We call use the built-in R function sum() to sum of all the values present in the vector. This function returns the value of the sum.
numeric_vectorA <- c(1, 2, 3)
sum_element_vector_A <- sum(numeric_vectorA)
print(sum_element_vector_A)
## [1] 6
By the same token, we can compute the mean value of the elements of a vector, by using the built-in R function mean().
numeric_vectorA <- c(1, 2, 3)
mean_element_vector_A <- mean(numeric_vectorA)
print(mean_element_vector_A)
## [1] 2
We can select some elements of a vector only. To do so we use brackets, and refer to the position of the elements in the vector.
numeric_vectorX <- c("X", "XX", "ZZZ")
numeric_vectorX23 <- numeric_vectorX[2:3]
print (numeric_vectorX23)
## [1] "XX" "ZZZ"
We can go through the elements of a vector using a loop, we now know how to do it. In the below example we create a vector linkedin that stores the number of views of different users. We then name each element of the vector after the user names. We want to print the names of the users for which the number of views is greater than 5:
#create the vector of number of views
linkedin<-c(10,20,5,0,10,20,30,54)
#create the vector of names (of users) corresponding to the number of views
names(linkedin)<- c("Jean Paul", "Mark", "Helmut", "Brigitte", "Larz", "Claire", "Toto", "Cici")
#show the elements of the vector meeting the condition number of views > 5
print(linkedin > 5)
## Jean Paul Mark Helmut Brigitte Larz Claire Toto Cici
## TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
Second, for each element (number of views) of the vector, we want to check whether the user can be considered as popular, that is whether the number of views is greater than 10 in that case, using a loop this time:
linkedin<-c(10,20,5,0,10,20,30,54)
names(linkedin)<- c("Jean Paul", "Mark", "Helmut", "Brigitte", "Larz", "Claire", "Toto", "Cici")
#loop through all the elements of the vector (list) linkedin
for (i in 1:length(linkedin)) {
print(linkedin[i])
if (linkedin[i] > 10) {
print("You're popular!")
} else {
print("Be more visible!")
}
}
## Jean Paul
## 10
## [1] "Be more visible!"
## Mark
## 20
## [1] "You're popular!"
## Helmut
## 5
## [1] "Be more visible!"
## Brigitte
## 0
## [1] "Be more visible!"
## Larz
## 10
## [1] "Be more visible!"
## Claire
## 20
## [1] "You're popular!"
## Toto
## 30
## [1] "You're popular!"
## Cici
## 54
## [1] "You're popular!"
A matrix is a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns. Assume we want to analyze the box office numbers of the Star Wars franchise. For each movie, the first element of each vector indicates the US box office revenue and the second element refers to the Non-US box office.
#We create three vectors of revenues.
#In each vector, the first number corresponds to the revenue for the US box office, the second number the revenue for the non-US box office.
new_hope <- c(461,314)
empire_strikes <- c(290, 248)
return_jedi <- c(309, 165)
To create a matrix, we call the built-in function matrix(), we must indicate the number of rows we desire. We also set the option byrow to TRUE to fill the matrix by row. That is the vector new_hope will become the first row of the matrix, the vector empire-strikes the second row, and so on. If we want to fill a matrix by column, we must set byrow to FALSE.
# we create a matrix out of the three vectors we already created
star_wars_matrix <- matrix(c(new_hope, empire_strikes, return_jedi), nrow = 3, byrow = TRUE)
print(star_wars_matrix)
## [,1] [,2]
## [1,] 461 314
## [2,] 290 248
## [3,] 309 165
We then create vectors of regions and titles, used for naming of the matrix’s columns and rows. It helps to better see what the numbers are referring to.
regions <- c("US", "non-US")
titles <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")
colnames(star_wars_matrix)<-regions
rownames(star_wars_matrix)<-titles
print(star_wars_matrix)
## US non-US
## A New Hope 461 314
## The Empire Strikes Back 290 248
## Return of the Jedi 309 165
Alternatively, we can construct the matrix with the right names for columns and row from the beginning:
star_wars_matrix <- matrix(c(new_hope, empire_strikes, return_jedi),
nrow = 3, byrow = TRUE,
dimnames = list(titles, regions))
print(star_wars_matrix)
## US non-US
## A New Hope 461 314
## The Empire Strikes Back 290 248
## Return of the Jedi 309 165
We can perform standard operations on the elements of the matrix, like summing up the values stored in rows or/and columns. Below, we compute the total revenue for US and non-US. To do so, we call the built-in function colSums(), which computes the sum of each column’s values.
total_rev_region=colSums(star_wars_matrix)
print(total_rev_region)
## US non-US
## 1060 727
We then compute the total revenue per movie. To do so, we call the built-in function rowSums(), which computes the sum of each row’s value:
total_rev_mov=rowSums(star_wars_matrix)
print(total_rev_mov)
## A New Hope The Empire Strikes Back Return of the Jedi
## 775 538 474
We finally compute the sum of the revenues for all movies and regions, using the function sum():
total_rev=sum(star_wars_matrix)
print(total_rev)
## [1] 1787
To add an extra row or column to the matrix, we use the function rbind() or cbind(), respectively. To add the extra column summing US and non US revenues for each movie, we bind this column with the rest of the matrix, by calling the function cbind().
final_matrix_starwars<-cbind(star_wars_matrix,total_rev_mov)
print(final_matrix_starwars)
## US non-US total_rev_mov
## A New Hope 461 314 775
## The Empire Strikes Back 290 248 538
## Return of the Jedi 309 165 474
To add the extra row summing revenues for each region, we bind this row with the rest of the matrix, by calling the built-in function rbind().
final_matrix_starwars<-rbind(star_wars_matrix,total_rev_region)
print(final_matrix_starwars)
## US non-US
## A New Hope 461 314
## The Empire Strikes Back 290 248
## Return of the Jedi 309 165
## total_rev_region 1060 727
Similar to vectors, we can use the square brackets to select one or multiple elements from a matrix. We use a comma to separate the rows we want to select from the columns. We go:
my_matrix[row_we_want , column_we_want]
To select non-US movies only, we use [,2], that is all the rows of the original matrix but only the column number two:
print(star_wars_matrix[,2])
## A New Hope The Empire Strikes Back Return of the Jedi
## 314 248 165
To select revenues in both regions for the second movie only (The Empire Strikes Back), we use [2,]:
print(star_wars_matrix[2,])
## US non-US
## 290 248
To select US revenues for the two first movies only, we use [1:2,1]:
print(star_wars_matrix[1:2,1])
## A New Hope The Empire Strikes Back
## 461 290
The standard operators like +, -, /, * work in an element-wise way on matrices. We can divide by 10 all the values stored in the star wars matrix for instance:
print(star_wars_matrix/10)
## US non-US
## A New Hope 46.1 31.4
## The Empire Strikes Back 29.0 24.8
## Return of the Jedi 30.9 16.5
We can also multiply two matrices. Let us assume that we have two matrices, one with the ticket prices for each movie and region and another one with the number of visitors for each movie and region. We can compute the revenue per region of each movie:
# We first create ticket price matrices
new_hope <- c(5.0, 5.0)
empire_strikes <- c(6.0, 6.0)
return_jedi <- c(7.0, 7.0)
star_wars_matrix_ticket <- matrix(c(new_hope, empire_strikes, return_jedi),
nrow = 3, byrow = TRUE,
dimnames = list(titles, regions))
print(star_wars_matrix_ticket)
## US non-US
## A New Hope 5 5
## The Empire Strikes Back 6 6
## Return of the Jedi 7 7
# We then create the number of visitors matrix
new_hope <- c(1000000, 2000000)
empire_strikes <- c(15000000, 5260000)
return_jedi <- c(42100000, 7000000)
star_wars_matrix_vis <- matrix(c(new_hope, empire_strikes, return_jedi),
nrow = 3, byrow = TRUE,
dimnames = list(titles, regions))
print(star_wars_matrix_vis)
## US non-US
## A New Hope 1000000 2000000
## The Empire Strikes Back 15000000 5260000
## Return of the Jedi 42100000 7000000
# Then, we multiply the matrices
star_wars_matrix_rev<-star_wars_matrix_ticket*star_wars_matrix_vis
print(star_wars_matrix_rev)
## US non-US
## A New Hope 5000000 10000000
## The Empire Strikes Back 90000000 31560000
## Return of the Jedi 294700000 49000000
Key for any data analysis. A data frame has the variables of a dataset as columns and the observations as rows.The advantage of a data frame over a matrix is that whereas all the elements of a matrix must have the same type, only the elements within a column are required to have the same type for a data frame. Different columns can be of different data type. We can mix strings, booleans, and decimal numbers for instance. Most of the functions we have seen for matrices apply to data frames too.
For illustration purposes, we will work with a data frame that is built-in R. This data frame is mtcars. Data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). For more information about the mtcars dataset please refer to this link. Let us first print the data frame mtcars to see how it looks like:
print(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
It is often useful to show only a small part of the entire dataset to see its structure, this is what achieves the head () function:
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Another method that is often used to get a rapid overview of the data in a data frame is the function str(). It shows you the structure of the data set:
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
We now cover how to create a data frame from scratch. We want to create a data frame of planets, giving some information for each planet of the dataset. We will create a data frame out of vectors. We define a series of vectors and then combine them into a data frame:
# We define the vectors of planet names and planet types
name <- c("Mercury", "Venus", "Earth",
"Mars", "Jupiter", "Saturn",
"Uranus", "Neptune")
type <- c("Terrestrial planet",
"Terrestrial planet",
"Terrestrial planet",
"Terrestrial planet", "Gas giant",
"Gas giant", "Gas giant", "Gas giant")
# Importantly, we define the vector type as a factor, that is as a vector taking a small number of distinct values. A factor variable is a variable used to categorize and store the data.
type<-factor(type)
diameter <- c(0.382, 0.949, 1, 0.532,
11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03,
0.41, 0.43, -0.72, 0.67)
# We create a data frame from the vectors
planets_df <- data.frame(name, type, diameter, rotation)
We explore our data frame:
head(planets_df)
## name type diameter rotation
## 1 Mercury Terrestrial planet 0.382 58.64
## 2 Venus Terrestrial planet 0.949 -243.02
## 3 Earth Terrestrial planet 1.000 1.00
## 4 Mars Terrestrial planet 0.532 1.03
## 5 Jupiter Gas giant 11.209 0.41
## 6 Saturn Gas giant 9.449 0.43
str(planets_df)
## 'data.frame': 8 obs. of 4 variables:
## $ name : chr "Mercury" "Venus" "Earth" "Mars" ...
## $ type : Factor w/ 2 levels "Gas giant","Terrestrial planet": 2 2 2 2 1 1 1 1
## $ diameter: num 0.382 0.949 1 0.532 11.209 ...
## $ rotation: num 58.64 -243.02 1 1.03 0.41 ...
We can add a column or a row using the function cbind() and rbind(), respectively. For instance, to add a column about planet rings:
rings_vector<-c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
planets_df<-cbind(planets_df,rings_vector)
print(planets_df)
## name type diameter rotation rings_vector
## 1 Mercury Terrestrial planet 0.382 58.64 FALSE
## 2 Venus Terrestrial planet 0.949 -243.02 FALSE
## 3 Earth Terrestrial planet 1.000 1.00 FALSE
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
## 5 Jupiter Gas giant 11.209 0.41 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 8 Neptune Gas giant 3.883 0.67 TRUE
As for vectors and matrices, we can select some elements from a data frame with the help of square brackets. For instance, print out the diameter of Mercury (row 1, column 3):
print(planets_df[1,3])
## [1] 0.382
Print out data for Mars (entire fourth row):
print(planets_df[4,])
## name type diameter rotation rings_vector
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
To select a specific column, we can also use its name:
print(planets_df[1:3,"type"])
## [1] Terrestrial planet Terrestrial planet Terrestrial planet
## Levels: Gas giant Terrestrial planet
We can also put a restriction on the row of the data frame we select. For instance, below we show the data frame for planets that have a diameter greater than 1 (and select all the columns):
print(planets_df[planets_df$diameter>1,])
## name type diameter rotation rings_vector
## 5 Jupiter Gas giant 11.209 0.41 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 8 Neptune Gas giant 3.883 0.67 TRUE
Show planet names with rings:
planets_df[planets_df$rings_vector==TRUE, "name"]
## [1] "Jupiter" "Saturn" "Uranus" "Neptune"
Notice that R has a built-in function to select a subset of a data frame called subset(). See R help for the required arguments and options. For instance, we show below the data only for the planets that have rings:
subset_df<-subset(planets_df, subset = planets_df$rings_vector==TRUE)
print(subset_df)
## name type diameter rotation rings_vector
## 5 Jupiter Gas giant 11.209 0.41 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 8 Neptune Gas giant 3.883 0.67 TRUE
we can order the rows of a data frame by the value of a variable. To do so, we use the built-in function order(). For instance, in the below example, we sort planets by diameter.
position<-order(planets_df$diameter)
planets_df[position,]
## name type diameter rotation rings_vector
## 1 Mercury Terrestrial planet 0.382 58.64 FALSE
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
## 2 Venus Terrestrial planet 0.949 -243.02 FALSE
## 3 Earth Terrestrial planet 1.000 1.00 FALSE
## 8 Neptune Gas giant 3.883 0.67 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 5 Jupiter Gas giant 11.209 0.41 TRUE
The order is increasing by default, we can also specify a decreasing order:
position<-order(planets_df$diameter, decreasing=TRUE)
planets_df[position,]
## name type diameter rotation rings_vector
## 5 Jupiter Gas giant 11.209 0.41 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 8 Neptune Gas giant 3.883 0.67 TRUE
## 3 Earth Terrestrial planet 1.000 1.00 FALSE
## 2 Venus Terrestrial planet 0.949 -243.02 FALSE
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
## 1 Mercury Terrestrial planet 0.382 58.64 FALSE
Notice that, at any time, to see the whole content of a dataframe we can go to the right-top window in RStudio and select the Environment tab, there if we double click on the name of our dataframe, RStudio will open a spreadsheet-like window in which we can see the whole content of the dataframe (all the rows, columns, and observations).
A function is a set of statements organized together to perform a specific task. So far, we have relied on pre-existing functions (built in R already), but we can also create our own ones. Let us create a function that checks whether a letter is contained in a string (a word).
To create a function we need to follow a specific syntax. We must name the function and indicate the arguments (variables) it receives between the parentheses, in our case:
We also must indicate what the function does within curly brackets. The function checks whether the word provided by the user has the desired letter in it and let the user now by printing the information. We rely on the built-in function grepl() to know whether the letter is present in the word. For more information, refer to the help on grepl() in R or here.
my_function <- function(word,letter) {
test<-grepl(letter, word)
if (test == TRUE) {
print(paste("The word",word, "has the letter",letter,"in it.",sep=" "))
}
else
{
print(paste("The word",word, "does not have the letter",letter,"in it.",sep=" "))
}
}
We now call our function to verify it works as expected:
my_function("alphabet","a")
## [1] "The word alphabet has the letter a in it."
my_function("barbecue","z")
## [1] "The word barbecue does not have the letter z in it."
You are about to tackle a series of exercises. When coding on your own, you will make mistakes and get stuck, that is how one learns programming. When this happens, I recommend going through these steps:
General way to go: turn a problem into small logical steps, and for each step, do some research on the best way to achieve it with R. For instance, if we want a function that compares two strings to know whether they match, we google it and find the right package/function to do so and then use R help to get to know how to use these functions.
A sustainable and efficient way to learn coding is to develop the ability to find help on the internet and correct your mistakes by yourself. When you will work on your own application or develop one for a company, you cannot expect someone to help you every time you are stuck. This is key for you to continue improving your programming skills once you will know the basics and be done with this course.
To practice, please consider the following exercises. You should find your own way to solve the problems but a solution is provided for guidance - to see it you have to click the code button below each exercise.
You are going to work with the data frame mtcars. Recall that it presents, in a data frame, data extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). You are asked to rank the automobiles by the variable mpg from the highest to the lowest. Then, find the names of the automobiles that have a cylinder (variable cyl) equal or above six only. Finally, compute and print the mean value for each numeric variables of the data frame.
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Exercise 1 - Possible solution
#order the automobiles by mpg, highest values first
print(mtcars[order(-mtcars$mpg),])
#show automobiles with cylinder >= 6
print(mtcars[mtcars$cyl>=6,])
#compute mean value for each variable and print it
means_of_col <- colMeans(mtcars)
print(means_of_col)
Create a function that shows the result of a multiplication by 10 of the number you give as an argument.
Exercise 2 - Possible solution
my_function <- function(number_to_multiply) {
result <- number_to_multiply*10
print(result)
}
my_function(50)
Find the R function that converts all of the characters in a string to upper case and create a function that takes as argument a word and print it in uppercase letters.
Exercise 3 - Possible solution
my_function <- function(word) {
uppercase_word <- toupper(word)
print(uppercase_word)
}
my_function("test")
create a function that prints 5 letters (randomly choosen) out of the word you give as argument.
Exercise 4 - Possible solution
my_function <- function(word) {
#we have done some research, we will use the built-in function runif(). This function generates five random numbers between 1 and the length of the word we give to the function. We round these numbers to get integers.
vector_random_number <- round(runif(5, 1, nchar(word)))
print("The five random letter out the word provided are:")
for (i in 1:5) {
# we use the function substr() to extract the letter for the random position corresponding to the element i of the vector of random numbers.
letter_to_show <- substr(word,vector_random_number[i], vector_random_number[i])
print(letter_to_show)
}
}
my_function("abcdefghijklmnopqrstuvwxyz")
From now on, the functions we use are just going to be named, then you are free to use R help or Internet to get help on what these functions achieve exactly and how to use them, it is a good habit to develop. Also, we are mostly going to work with data frames.
Let us assume we want to work on COVID data for France. We usually have two options. We can download the data first on our local computer, usually in .csv format and then import it to R, or download them directly with R from the Internet source.
In terms of standard data sources, when it comes to COVID data for France, but also more generally, we have plenty of choices, we can use for instance:
Whatever our starting point is, once we have the raw data, next we have to store them in a data frame to be able to treat,explore, and visualize them.
To do so, we are going to use packages created by R users, they include functions that are not built in R, and that can be very useful to analyze and visualize data of a data frame. We will see how to install and work with the following packages: dplyr, ggplot2, raster, and leaflet.
All the below pieces of code are available on Blackboard in the R files sections_31_37.R and section_38.R.
We go to data.gouv.fr and download data on COVID-related hospitalizations, reanimations, and deaths in France.
We want to download the file: “donnees-hospitalieres-covid19-2021-09-02-19h05.csv”. The link is the following.
Using R, we can download the file directly into a data frame and then explore it. We use the read.csv() function of R.
new_data_frame <- read.csv(file="https://www.data.gouv.fr/fr/datasets/r/63352e38-d353-4b54-bfd1-f1b3ee1cabd7", header=T, sep=";")
Alternatively, we can first download the file on our machine and rename it for instance, and then create a data frame using this csv file. This is a better practice:
new_data_frame <- read.csv("covid_data.csv", header=T, sep=";")
You can find the csv data file on Blackboard, under the name covid_data.csv.
Information about the variables of this dataset are available in the readme file attached to the data, available here. As you can read, the dataset includes, among others, the following variables:
Notice that the file reports cumulative sums at a given point in time. Hence, it is okay to sum values across department for instance but it does not make sense to sum them across dates. We first check what the dataset looks like using the two functions introduced previously: head() and str():
head(new_data_frame)
## dep sexe jour hosp rea rad dc
## 1 1 0 18/03/2020 2 0 1 0
## 2 1 1 18/03/2020 1 0 1 0
## 3 1 2 18/03/2020 1 0 0 0
## 4 2 0 18/03/2020 41 10 18 11
## 5 2 1 18/03/2020 19 4 11 6
## 6 2 2 18/03/2020 22 6 7 5
str(new_data_frame)
## 'data.frame': 172630 obs. of 7 variables:
## $ dep : chr "1" "1" "1" "2" ...
## $ sexe: int 0 1 2 0 1 2 0 1 2 0 ...
## $ jour: chr "18/03/2020" "18/03/2020" "18/03/2020" "18/03/2020" ...
## $ hosp: int 2 1 1 41 19 22 4 1 3 3 ...
## $ rea : int 0 0 0 10 4 6 0 0 0 1 ...
## $ rad : int 1 1 0 18 11 7 1 0 1 2 ...
## $ dc : int 0 0 0 11 6 5 0 0 0 0 ...
We create a date out of the time information provided, only a date formatted this way will be recognized by R. To do so, we use the as.Date() function, see documentation here.
new_data_frame$date<-as.Date(new_data_frame$jour, tryFormats = c("%d/%m/%Y"))
head(new_data_frame$date)
## [1] "2020-03-18" "2020-03-18" "2020-03-18" "2020-03-18" "2020-03-18"
## [6] "2020-03-18"
We drop the variables we are not interested in. To do so, we use the built-in function subset().
new_data_frame <- subset(new_data_frame, select = c (dep,sexe,hosp,rea,rad,dc,date))
head(new_data_frame)
## dep sexe hosp rea rad dc date
## 1 1 0 2 0 1 0 2020-03-18
## 2 1 1 1 0 1 0 2020-03-18
## 3 1 2 1 0 0 0 2020-03-18
## 4 2 0 41 10 18 11 2020-03-18
## 5 2 1 19 4 11 6 2020-03-18
## 6 2 2 22 6 7 5 2020-03-18
We change some column names so that they are more telling (the names of the variables of the data frame):
colnames(new_data_frame)<-c("department","gender","nb_hospitalizations","nb_reanimations","nb_returned_home","nb_deaths","date")
head(new_data_frame)
## department gender nb_hospitalizations nb_reanimations nb_returned_home
## 1 1 0 2 0 1
## 2 1 1 1 0 1
## 3 1 2 1 0 0
## 4 2 0 41 10 18
## 5 2 1 19 4 11
## 6 2 2 22 6 7
## nb_deaths date
## 1 0 2020-03-18
## 2 0 2020-03-18
## 3 0 2020-03-18
## 4 11 2020-03-18
## 5 6 2020-03-18
## 6 5 2020-03-18
We want to indicate that the gender variable is a factor variable, because we do not want to treat its values as integers (numerical values). Recall that a factor variable is a vector taking a small number of distinct values. A factor variable is a variable used to categorize and store the data. We know from the information file on data.gouv.fr that 0 stands for males & females, 1 stands for males only, and 2 for females only.
new_data_frame$gender<-factor(new_data_frame$gender,labels=c("males & females","males","females"))
head(new_data_frame)
## department gender nb_hospitalizations nb_reanimations
## 1 1 males & females 2 0
## 2 1 males 1 0
## 3 1 females 1 0
## 4 2 males & females 41 10
## 5 2 males 19 4
## 6 2 females 22 6
## nb_returned_home nb_deaths date
## 1 1 0 2020-03-18
## 2 1 0 2020-03-18
## 3 0 0 2020-03-18
## 4 18 11 2020-03-18
## 5 11 6 2020-03-18
## 6 7 5 2020-03-18
We can generate summary statistics of the data by calling the built-in function summary():
summary(new_data_frame)
## department gender nb_hospitalizations
## Length:172630 males & females:57732 Min. : 0
## Class :character males :57732 1st Qu.: 15
## Mode :character females :57166 Median : 48
## Mean : 112
## 3rd Qu.: 128
## Max. :3281
## nb_reanimations nb_returned_home nb_deaths date
## Min. : 0.00 Min. : 0 Min. : 0.0 Min. :2020-03-18
## 1st Qu.: 1.00 1st Qu.: 197 1st Qu.: 43.0 1st Qu.:2020-08-06
## Median : 5.00 Median : 581 Median : 132.0 Median :2020-12-25
## Mean : 16.96 Mean : 1365 Mean : 311.3 Mean :2020-12-25
## 3rd Qu.: 17.00 3rd Qu.: 1501 3rd Qu.: 365.0 3rd Qu.:2021-05-16
## Max. :855.00 Max. :25253 Max. :4759.0 Max. :2021-10-04
Now we have formatted the dataset in a data frame, we can start exploring it in more details. Let us say that we want to know and show the total number of hospitalizations, deaths, or people in critical cares by department or at specific date. To do so, we are going to use a package of R that has been created to analyze data frames. The package is dplyr, you can find more information about it here. If the package is not already installed on your machine. You need to install it.
To do so, use the command:
install.packages("dplyr")
Then, to use the functions in this package, we load into our package library the package dplyr at the beginning of the script by calling the function library().
library(dplyr)
The main functions of the package are:
dplyr works with so called pipes (%>%). Pipes take the output from one function and feed it to the first argument of the next function. This way, we can combine the above functions, as we will see shortly.
To select some variables of a data frame, we can call the dplyr function select(), and using pipes, we go:
#Here we create a new data frame consisting of the variables nb_hospitalizations, nb_reanimations, nb_deaths, and department only.
new_data_frame_2 <- new_data_frame %>% select(c(nb_hospitalizations,nb_reanimations,nb_deaths,department))
head(new_data_frame_2)
## nb_hospitalizations nb_reanimations nb_deaths department
## 1 2 0 0 1
## 2 1 0 0 1
## 3 1 0 0 1
## 4 41 10 11 2
## 5 19 4 6 2
## 6 22 6 5 2
If we want to show the data for one department only, let us say the department “75”, we can use filter():
new_data <- new_data_frame %>% filter(department == "75")
head(new_data)
## department gender nb_hospitalizations nb_reanimations
## 1 75 males & females 359 105
## 2 75 males 217 70
## 3 75 females 139 35
## 4 75 males & females 453 122
## 5 75 males 279 85
## 6 75 females 170 37
## nb_returned_home nb_deaths date
## 1 40 14 2020-03-18
## 2 22 10 2020-03-18
## 3 18 4 2020-03-18
## 4 62 22 2020-03-19
## 5 31 16 2020-03-19
## 6 30 6 2020-03-19
We can summarize the data by department for males & females only at a specific date using the filter() function. For instance, to show the number of hospitalizations for males and females as of 2021-08-08, we write:
new_data <- new_data_frame %>% filter(date=="2021-08-08" & gender=="males & females") %>% select(c(nb_hospitalizations,department))
head(new_data,10)
## nb_hospitalizations department
## 1 68 1
## 2 37 2
## 3 15 3
## 4 43 4
## 5 23 5
## 6 249 6
## 7 26 7
## 8 2 8
## 9 31 9
## 10 26 10
We can achieve about the same result (the difference is caused by unspecified gender values) by filtering out males and females separately and then sum the values by department for the same date as above. To sum by group, we first group the observation by a variable (department in this case), using the function group_by() and then, we call summarize(). More information on summarize can be found here:
new_data <- new_data_frame %>% filter(date=="2021-08-08" & gender!="males & females") %>% select(c(nb_hospitalizations,department)) %>% group_by(department) %>% summarize(sum_nb_hostpitalizations_20210808=sum(nb_hospitalizations))
head(new_data,10)
## # A tibble: 10 x 2
## department sum_nb_hostpitalizations_20210808
## <chr> <int>
## 1 1 68
## 2 10 26
## 3 11 75
## 4 12 25
## 5 13 656
## 6 14 73
## 7 15 4
## 8 16 18
## 9 17 81
## 10 18 9
To show the number of deaths by gender as of a certain date, we combine filter(), select(), group_by(), and summarize():
new_data <- new_data_frame %>% filter(date=="2021-08-08" & gender!="males & females") %>% select(c(nb_deaths,gender)) %>% group_by(gender) %>% summarize(tot_deaths_by_gender_20210808=sum(nb_deaths))
head(new_data,10)
## # A tibble: 2 x 2
## gender tot_deaths_by_gender_20210808
## <fct> <int>
## 1 males 49185
## 2 females 35922
Let us say we want to create a new variable that is equal to the ratio between the total number of hospitalizations and the total number of deaths. To create a new variable with dplyr, we can use the function mutate().
new_data <- new_data_frame %>% filter(nb_deaths>0) %>% mutate(ratio_hospitalizations_to_deaths=nb_hospitalizations/nb_deaths)
head(new_data,10)
## department gender nb_hospitalizations nb_reanimations
## 1 2 males & females 41 10
## 2 2 males 19 4
## 3 2 females 22 6
## 4 6 males & females 25 1
## 5 6 females 10 0
## 6 11 males & females 8 7
## 7 11 males 6 5
## 8 11 females 2 2
## 9 13 males & females 98 11
## 10 13 males 50 6
## nb_returned_home nb_deaths date ratio_hospitalizations_to_deaths
## 1 18 11 2020-03-18 3.727273
## 2 11 6 2020-03-18 3.166667
## 3 7 5 2020-03-18 4.400000
## 4 47 2 2020-03-18 12.500000
## 5 28 2 2020-03-18 5.000000
## 6 9 3 2020-03-18 2.666667
## 7 2 2 2020-03-18 3.000000
## 8 7 1 2020-03-18 2.000000
## 9 59 4 2020-03-18 24.500000
## 10 35 2 2020-03-18 25.000000
Next, we see how to show, with plots, some features and insights about our data.
Before that, let us know practice a bit, you have to show the following information:
exo_1 <- new_data_frame %>% filter(gender=="females" & date=="2020-12-31") %>% select(c(nb_reanimations))
head(exo_1,10)
exo_2 <- new_data_frame %>% filter(gender=="males & females" & department==44) %>% select(c(date,nb_deaths))
head(exo_2,10)
exo_3 <- new_data_frame %>% filter(gender!="males & females") %>% select(c(department,nb_deaths,gender)) %>% group_by(department,gender) %>% summarize(mysum=sum(nb_deaths))
exo_3 <- exo_3 %>% group_by(department) %>% summarize(ratio = mysum[gender=="males"]/mysum[gender=="females"])
head(exo_3,10)
exo_4 <- exo_3 %>% arrange(ratio)
head(exo_4,10)
We use the package ggplot2 to map variables to nice graphics. More information about the package can be found here. We start by installing and loading the package’s functions to R.
install.packages("ggplot2")
library(ggplot2)
Now, we can use the package’s functions. Here you can find a great cheatsheet summarizing the key functions, what they are their arguments, and how to call them.
Let us start with something simple. We want to show the total number of deaths for males and females together over time on a chart (that is we take the sum across departments). To that end, we use first dplyr to get the data we are after and then we construct a plot with ggplot.
#We retrieve the total number of deaths for the group (males & females) and sum across department for each available date
new_data <- new_data_frame %>% filter(gender=="males & females") %>% select(c(nb_deaths,date,department)) %>% group_by(date) %>% summarize(sum_nb_deaths=sum(nb_deaths))
#Then we show it on a graph using the ggplot function (package ggplot) - we draw a line using geom_line()
ggplot(data=new_data,aes(x=date,y=sum_nb_deaths))+ geom_line()
Notice the way we add features to a plot with ggplot(). We add a feature to the plot by calling additional parameters with the sign “+”.
We can change the theme of the plot, for instance we indicate that we want to use the theme classic.
ggplot(data=new_data,aes(x=date,y=sum_nb_deaths))+ geom_line() + theme_classic()
We can change the color and size of the line we draw.
ggplot(data=new_data,aes(x=date,y=sum_nb_deaths))+ geom_line(colour = "blue", size = 1) + theme_classic()
We can add a title, change the axis labels, add more values on the vertical axis, and show the axis labels vertically.
ggplot(data=new_data,aes(x=date,y=sum_nb_deaths)) + geom_line(colour = "blue") + theme_classic() + labs(x = "Date", y="Number of deaths")+ ggtitle("Number of deaths from COVID in France over time")+ theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+theme(plot.title = element_text(hjust = 0.5)) +scale_x_date(date_breaks = "1 month") +scale_y_continuous(n.breaks=30)
Now, we want to show the same chart but for the three gender groups (males+ females, males, and females):
#We retrieve the total number of deaths for the groups (males & females, males, and females) and sum across department for each available date and group.
new_data <- new_data_frame %>% select(c(nb_deaths,date,gender)) %>% group_by(date,gender) %>% summarize(sum_nb_deaths=sum(nb_deaths))
#Then we show it on a graph using the ggplot function (package ggplot)
ggplot(data=new_data,aes(x=date,y=sum_nb_deaths,group=gender)) + geom_line(aes(color=gender))+ theme_classic() + labs(x = "Date", y="Number of deaths")+ ggtitle("Number of deaths from COVID in France over time by gender")+ theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+theme(plot.title = element_text(hjust = 0.5))+scale_x_date(date_breaks = "1 month")+scale_y_continuous(n.breaks=30)+scale_color_discrete("Gender category")
We can also show the total number of deaths by gender (males vs. females) in the form of a bar chart (see the function geom_col()) as of a specific date:
#We retrieve the total number of deaths for the groups (males, and females) and sum across department at a given date
new_data <- new_data_frame %>% filter(date=="2021-07-08" & gender!="males & females") %>% select(c(nb_deaths,gender)) %>% group_by(gender) %>% summarize(sum_nb_deaths=sum(nb_deaths))
#we use the geom_col() function of ggplot, see R help
ggplot(data=new_data,aes(x=gender,y=sum_nb_deaths)) + geom_col(aes(fill=gender))+ theme_classic() + ggtitle("Total number of deaths by gender as of 2021-07-08")+labs(fill = "Gender category",x="", y="Number of deaths")
We can show the observations we have in our data set for the number of hospitalizations over time, for males and females, for the department 44, as points or dots.
new_data <- new_data_frame %>% filter(department=="44" & gender!="males & females") %>% select(c(nb_hospitalizations,gender,date))
#we use the geom_point() function of ggplot, see R help
ggplot(data=new_data,aes(x=date,y=nb_hospitalizations,fill=gender)) + geom_point(size=1, shape=23)+ theme_classic() + labs(y="Number of deaths")+ ggtitle("Number of hospitalizations in France over time by gender")+labs(fill = "Gender categories",x="Date", y="Number of hospitalizations")+theme(plot.title = element_text(hjust = 0.5))+scale_x_date(date_breaks = "1 month")+theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+scale_y_continuous(n.breaks=30)
Finally, we can create a heatmap for the total number of deaths for males or females over a data range:
new_data <- new_data_frame %>% filter(gender!="males & females" & date>="2020-08-08" & date<="2021-08-08") %>% select(c(nb_deaths,gender,date)) %>% group_by(date,gender) %>% summarize(sum_nb_deaths=sum(nb_deaths))
ggplot(new_data, aes(x = date, y = gender, fill=sum_nb_deaths)) + geom_tile() + theme(plot.title = element_text(hjust = 0.5)) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + labs(fill="Nb deaths", x="Date", y="Gender category") + ggtitle("Nb deaths by date and gender")
Notice, that more themes are available if we install and load the package ggthemes.
install.packages("ggthemes")
#We load the package
library(ggthemes)
We can now show the observations we have in our data set for the number of hospitalizations over time, for males and females, for the department 44, as points or dots, using the theme The Economist (the magazine):
new_data <- new_data_frame %>% filter(department=="44" & gender!="males & females") %>% select(c(nb_hospitalizations,gender,date))
#we use the geom_dotplot function of ggplot, see R help
ggplot(data=new_data,aes(x=date,y=nb_hospitalizations,fill=gender)) + geom_point(size=1, shape=23)+ theme_economist() + labs(y="Number of deaths")+ ggtitle("Number of hospitalizations in France over time by gender")+labs(fill = "Gender categories",x="Date", y="Number of hospitalizations")+theme(plot.title = element_text(hjust = 0.5))+scale_x_date(date_breaks = "1 month")+theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+scale_y_continuous(n.breaks=30)
We wish to add the population per department to our dataset to be able to scale the number of deaths or hospitalizations. To that end, we download the following xlsx file from INSEE. We only keep the spreadsheet 2021, rename some columns, and save it under pop_2021.csv (available on Blackboard).
#see a bit a later why we need to import the package stringr
library(stringr)
library(dplyr)
#retrieve data
pop_data <- read.csv("pop2021.csv", sep=";")
#change the column names of the data frame
names(pop_data) <- c("department","name_dep","pop")
head(pop_data)
## department name_dep pop
## 1 1 Ain 662 244
## 2 2 Aisne 525 503
## 3 3 Allier 331 745
## 4 4 Alpes-de-Haute-Provence 165 702
## 5 5 Hautes-Alpes 140 022
## 6 6 Alpes-Maritimes 1 089 270
#because the population is coded as string with some spaces between numbers, R cannot interpret it as a number.
#We need to remove spaces between numbers - to do so we nee to load the package stringr, and, in particular, use the function str_replace.
pop_data$pop<-str_replace_all(pop_data$pop, pattern=" ", repl="")
#then we convert the strings to numerical values, using the function as.numeric()
pop_data$pop<-as.numeric(pop_data$pop)
head(pop_data)
## department name_dep pop
## 1 1 Ain 662244
## 2 2 Aisne 525503
## 3 3 Allier 331745
## 4 4 Alpes-de-Haute-Provence 165702
## 5 5 Hautes-Alpes 140022
## 6 6 Alpes-Maritimes 1089270
Next we merge the data we have: we add the department population to the COVID dataset. To do so, we use the dplyr function merge():
#we isolate data on the number of deaths by department for males & females as of the 2021-08-31
new_data <- merge(new_data_frame, pop_data, by.x = "department", by.y = "department")
head(new_data)
## department gender nb_hospitalizations nb_reanimations
## 1 1 females 13 0
## 2 1 males 49 4
## 3 1 females 155 4
## 4 1 males & females 223 21
## 5 1 males & females 161 14
## 6 1 males & females 145 16
## nb_returned_home nb_deaths date name_dep pop
## 1 222 44 2020-08-16 Ain 662244
## 2 1418 364 2021-06-20 Ain 662244
## 3 716 147 2020-12-24 Ain 662244
## 4 1677 411 2021-01-15 Ain 662244
## 5 2282 547 2021-03-19 Ain 662244
## 6 2770 608 2021-05-23 Ain 662244
We can now create and show the ratio of COVID deaths to population by department at a given date to increase comparability of the information we provide across department. We sort departments according to the ratio, using the arrange() function. We display the ten departments with the highest ratios:
new_data2<- new_data %>% filter(date=="2021-07-08" & gender=="males & females") %>% mutate(deaths_pop_ratio=(nb_deaths/pop)*100) %>% select(c(department,name_dep,deaths_pop_ratio))
new_data2<- new_data2 %>% arrange(-deaths_pop_ratio)
head(new_data2,10)
## department name_dep deaths_pop_ratio
## 1 90 Territoire de Belfort 0.4329348
## 2 88 Vosges 0.2393938
## 3 57 Moselle 0.2321752
## 4 52 Haute-Marne 0.2211245
## 5 55 Meuse 0.2133852
## 6 75 Paris 0.2128488
## 7 2 Aisne 0.2121777
## 8 94 Val-de-Marne 0.2105639
## 9 71 Saône-et-Loire 0.2021269
## 10 68 Haut-Rhin 0.1991356
Finally, we can create a pie chart out of the data we constructed:
#Basic pie chart
ggplot(data=new_data2[0:10,], aes(x="", y=deaths_pop_ratio, fill=name_dep)) +
geom_bar(stat="identity", width=1, color="white") +
coord_polar("y", start=0) +
theme_void()
Now that we know how to import data into a data frame, prepare our dataset, and show graphical outputs, let us practice. Based on the same set of data, you are asked to do the following tasks (a possible solution is hidden, you can click the button Code to unhide it).
Create a bar plot showing the number of deaths by gender (male or female) as of end of March 2021.
new_data <- new_data_frame %>% filter(date=='2021-03-31'& gender!="males & females") %>% select(c(nb_deaths,gender)) %>% group_by(gender) %>% summarize(sum_nb_deaths=sum(nb_deaths))
ggplot(data=new_data,aes(gender,sum_nb_deaths)) + geom_col(aes(fill=gender))+ theme_classic() + ggtitle("Total number of deaths from COVID as of 2020/03/31")+labs(fill = "Gender categories",x="", y="Number of deaths")
Show the number of reanimations (as a % of the population) in intensive care for males & females over time in the form of a line plot. Hint: we need to summarize across departments.
new_data <- merge(new_data_frame, pop_data, by.x = "department", by.y = "department")
new_data2 <- new_data %>% filter(gender=="males & females") %>% group_by(date) %>% mutate(sum_reanimations=sum(nb_reanimations)) %>% mutate(sum_pop=sum(pop)) %>% mutate(pct_reanimations=(sum_reanimations/sum_pop)*100) %>% select(c(pct_reanimations,date))
ggplot(data=new_data2,aes(x=date,y=pct_reanimations)) +
geom_line(size=1) + theme_economist() +
ggtitle("% French population in intensive care over time")+
theme(plot.title = element_text(hjust = 0.5))+
scale_x_date(date_breaks = "1 month")+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
scale_y_continuous(n.breaks=20)+
theme(axis.title.x = element_blank(),axis.title.y = element_blank())
Go to Kaggle, register, and download a dataset (.csv), load it into a data frame with R, and explore it with dplyr & ggplot (i.e., generate summary statistics and plots). For instance, you can find the Forbes list of worlds billionaires in 2018 there as well as some information for each billionaire in the list and then produce some summary statistics by country, gender, or industry.
Because we have the information by department, representing the information on a map can be a quick way to convey the gist of the data and share some insights. There is a first package in R that contains a collection of functions that will help us to design a map and custom it based on some values per department that we can find in our dataset. We need to import the package raster to get coordinates of regions and departments.
install.packages("raster")
library(raster)
We can use the package raster to create a map of the percentage of a department’s population that died from the COVID-19 as of March 2021:
#We restart from the step in which we add population data
new_data <- merge(new_data_frame, pop_data, by.x = "department", by.y = "department")
#we extract data at the department level as of end of March 2021
covid_death_per_dep <- new_data %>% filter(date=="2021-03-31" & gender=="males & females") %>% mutate(deaths_pop_ratio=(nb_deaths/pop)*100) %>% select(c(department,name_dep,deaths_pop_ratio))
#We load the map data of France - at the department level, see getData(). The object we get is a geospatial data frame.
formes <- getData(name="GADM", country="FRA", level=2)
#looking at the data, we see that NAME_2 gives the list of the departments in our forme file
head(formes)
## GID_0 NAME_0 GID_1 NAME_1 NL_NAME_1 GID_2 NAME_2
## 1 FRA France FRA.1_1 Auvergne-Rhône-Alpes <NA> FRA.1.1_1 Ain
## 5 FRA France FRA.1_1 Auvergne-Rhône-Alpes <NA> FRA.1.2_1 Allier
## 6 FRA France FRA.1_1 Auvergne-Rhône-Alpes <NA> FRA.1.3_1 Ardèche
## 7 FRA France FRA.1_1 Auvergne-Rhône-Alpes <NA> FRA.1.4_1 Cantal
## 8 FRA France FRA.1_1 Auvergne-Rhône-Alpes <NA> FRA.1.5_1 Drôme
## 9 FRA France FRA.1_1 Auvergne-Rhône-Alpes <NA> FRA.1.6_1 Haute-Loire
## VARNAME_2 NL_NAME_2 TYPE_2 ENGTYPE_2 CC_2 HASC_2
## 1 <NA> <NA> Département Department 01 FR.AI
## 5 Basses-Alpes <NA> Département Department 03 FR.AL
## 6 <NA> <NA> Département Department 07 FR.AH
## 7 <NA> <NA> Département Department 15 FR.CL
## 8 <NA> <NA> Département Department 26 FR.DM
## 9 <NA> <NA> Département Department 43 FR.HL
#in our COVID dataset and in our map data, there are department names we can use to do a matching
#We cannot use the merge function here because formes is a geospatial data frame,
#Hence, we use the following procedure to add the COVID data to the map data:
#we create an index linking the department of both datasets.
#we make us of the match function of R: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/match
# match() returns a vector of the positions of (first) matches of its first argument in its second.
idx <- match(formes$NAME_2, covid_death_per_dep$name_dep)
#Do head(idx), idx tells us that the first NAME of formes$NAME_2 has to be linked to the first name of covid_death_per_dep$name_dep for instance,
#but the second name of formes$NAME_2 with the third name of covid_death_per_dep$name_dep
#so IDX gives us the row positions for the department names of formes in new_data_2
#we extract the death_to_pop ratio from new_data2 for the positions given by idx
concordance <- covid_death_per_dep[idx, "deaths_pop_ratio"]
#and can now add them in the right order to the geospatial data frame
formes$deaths_pop_ratio <- concordance
#we select the map colors
couleurs <- colorRampPalette(c('white', 'red'))
#we call spplot to draw a map that shows the death_per_pop by department, using a color code to indicate the severity of the COVID effect.
spplot(formes, "deaths_pop_ratio",col.regions=couleurs(100), main=list(label="% population of department that died from COVID-19 as of 2021-03-31",cex=.8))
Now we will use the package leaflet to create a similar map. It offers more features. More on leaflet for R [there] (https://rstudio.github.io/leaflet/). It works with map data from website such as GoogleMap or MapBox. To have access t othe map data from MapBox, we need to access MapBox API. API stands for Application Programming Interface and is a way to access to public data shared by websites such as Facebook, Twitter, Spotify, GoogleMap, or, in this case, MapBox. The access to MapBox API, as it is often the case for API, requires a key. Make sure you create a MapBox account (free) there and retrieve your own API key for MapBox before working on the below code (the API key I use is hidden, you must replace My_API_KEY in the below code by your own API key for it to work as intended:
install.packages("leaflet")
library(leaflet)
#we use the same forme file as we created above (in the raster example)
#we create a MapBox account first and retrieve the token so the we can use the API
m <- leaflet(formes) %>%
addProviderTiles("MapBox", options = providerTileOptions(
id = "mapbox.light",
accessToken = Sys.getenv(MY_API_KEY)))
#create colors
pal <- colorBin("YlOrRd", domain = formes$deaths_pop_ratio, bins = 8)
#create labels
formes$label <- paste(formes$NAME_2,": ",round(formes$deaths_pop_ratio,2),"% deaths from COVID in the pop",sep="")
labels <- formes$label
#add polygons with colors and labels
m<- m %>% addPolygons(
fillColor = ~pal(deaths_pop_ratio),
weight = 2,
opacity = 1,
color = "white",
dashArray = "3",
fillOpacity = 0.7,
highlight = highlightOptions(
weight = 5,
color = "#666",
dashArray = "",
fillOpacity = 0.7,
bringToFront = TRUE), label = labels, labelOptions = labelOptions(style = list("font-weight" = "normal", padding = "3px 8px"), textsize = "15px",direction = "auto"))
#add legends
m <- m %>% addLegend(pal = pal, values = ~deaths_pop_ratio, opacity = 0.7,position = "bottomright",title = "% of deaths to pop")
#show leaflet map
m
One can also use the default map provider of leaflet (OpenStreetMap) that does not require an API key (notice that it is important to know how to use a package that requires an API key - because it is common among packages retrieving public data from online sources, for instance from Facebook or Twitter):
#we use the same forme file as we created above (in the raster example)
#we use the default map provider (OpenStreetMap) that does not require an API key
m <- leaflet(formes)
#create colors
pal <- colorBin("Blues", domain = formes$deaths_pop_ratio, bins = 8)
#create labels
formes$label <- paste(formes$NAME_2,": ",round(formes$deaths_pop_ratio,2),"% deaths from COVID in the pop",sep="")
labels <- formes$label
#add polygons with colors and labels
m<- m %>% addPolygons(
fillColor = ~pal(deaths_pop_ratio),
weight = 2,
opacity = 1,
color = "white",
dashArray = "3",
fillOpacity = 0.7,
highlight = highlightOptions(
weight = 5,
color = "#666",
dashArray = "",
fillOpacity = 0.7,
bringToFront = TRUE), label = labels, labelOptions = labelOptions(style = list("font-weight" = "normal", padding = "3px 8px"), textsize = "15px",direction = "auto"))
#add legends
m <- m %>% addLegend(pal = pal, values = ~deaths_pop_ratio, opacity = 0.7,position = "bottomright",title = "% of deaths to pop")
#show leaflet map
m
Find a dataset on Kaggle related to the COVID and from which you can display data on a map. To guide you, we provide below another application with a dataset from Kaggle (a world map this time). We want to show information about vaccination progress across countries in the world. We download this dataset - also available on Blackboard under world_vac.csv (data as of the 2021-10-04).The R file to implement this application is available on Blackboard under world_vac_case.R.
world_vac<-read.csv("world_vac.csv",sep=",")
world_vac<- world_vac %>% select(-c(Doses.Administered,Doses.per.1000,Vaccine.being.used.in.a.country))
#we use the rename() function of dplyr to change the name of some columns in our data frame (new name = old name)
world_vac<- world_vac %>% rename(pct_pop_vaccinated=Fully.Vaccinated.Population....)
world_vac<- world_vac %>% rename(country=Country)
head(world_vac)
## country pct_pop_vaccinated
## 1 World 35.3
## 2 China 75.2
## 3 India 18.2
## 4 United States 56.7
## 5 Brazil 44.9
## 6 Japan 61.0
str(world_vac)
## 'data.frame': 207 obs. of 2 variables:
## $ country : chr "World" "China" "India" "United States" ...
## $ pct_pop_vaccinated: num 35.3 75.2 18.2 56.7 44.9 61 19.7 54.7 64.8 36.3 ...
Now we draw the map. We use the package rworldmap this time because it suits our purpose.
library(rworldmap)
#Then, we join the vaccination data to the world map of the library calling the joinCountryData2Map() function
joinData <- joinCountryData2Map( world_vac,
joinCode = "NAME",
nameJoinColumn = "country")
## 199 codes from your data successfully matched countries in the map
## 8 codes from your data failed to match with a country code in the map
## 44 codes from the map weren't represented in your data
joinData <- subset(joinData, continent != "Antarctica")
#we plot the map
theMap <- mapCountryData(joinData, nameColumnToPlot="pct_pop_vaccinated",mapTitle="Percentage of the population vaccinated by country as of 2021-10-04")
Now, it is your turn.
You can find a good tutorial to prepare this part of the course here. For a more complete introduction to Shiny, please refer to this link, I recommend studying it in advance, before the class.
First, we install and load Shiny.
install.packages("shiny")
library(shiny)
To create a Shiny app, a good practice is to create a new directory for your app, and put a single file called app.R in it for instance. To create a new directory use the explorer in the bottom right of RStsudio. All the file (images, csv files) you need to use in your app need to be stored in the same folder. We write the following code in app_ex_1.R, the file is available on Blackboard. We report the code below but the outcome of the code cannot be shown directly in this document, we must run it in RStudio.
#Load the shiny package
library(shiny)
#There is the user interface (ui) part: we set up what the user will see - we define the HMTL page users will interact with.
#currently, is show Hello, world!
ui <- fluidPage(
"Hello, world!"
)
#There is the server part: we set up the computations we do and how the app reacts to inputs from an user
#it is currently empty, so the app does not do anything
server <- function(input, output, session) {
}
#we create the APP - it will start the app on your local machine - later on we will see how to publish this app online.
shinyApp(ui, server)
Run the code in RStudio to see the app it creates.
The code tells Shiny both how our app should look, and how it should behave. The code defines the user interface and specifies the behavior of our app by defining a server function. Finally, it executes shinyApp(ui, server) to construct and start a Shiny application from ui and server. The app runs locally (http://127.0.0.1) as it is reported in the console. We will see later on how to publish it online.
Next, we want to add some inputs and outputs to our user interface (ui). We are going to make a very simple app that shows to the user all the variables included in a data frame. The user can then pick the variables she wants information on. We will use the following Shiny functions:
The steps are the following: we pre-load the dataset, we pre-load the variables names in the select box, the user chooses a variable in the select box, based on the choice of the user the app displays back some information about the variable.
We use the same COVID dataset as before: covid_date.csv.
We create app_ex_2.R as follows, also available on Blackboard:
#Load the shiny and dplyr packages
library(shiny)
library(dplyr)
#we get COVID data from our usual source and format it the way we did before
new_data_frame <- read.csv(file="covid_data.csv",header=T,sep=";")
new_data_frame <- new_data_frame %>% select(c(dep,sexe,jour,hosp,rea,rad,dc))
new_data_frame$date<-as.Date(new_data_frame$jour, tryFormats = c("%d/%m/%Y"))
new_data_frame <- new_data_frame %>% select(c(-jour))
colnames(new_data_frame)<-c("department","gender","Number of hospitalizations","Number of reanimations","Number of returned home","Number of deaths","date")
new_data_frame$gender<-factor(new_data_frame$gender,labels=c("males & females","males","females"))
#create the user interface (ui), this time we specify more things:
ui <- fluidPage(
#we create a select box containing all the variables of the data frame new_data_frame. The name of this select box is variable.
varSelectInput("variable", label = "Please choose a variable", new_data_frame),
#we show the text output called summary that is created by our server function upon an action of the user
verbatimTextOutput("summary"),
#we show the table called table that is created by our server function upon an action of the user
tableOutput("table")
)
#we store the server functions here - where we perform calculations based on the user's choices
server <- function(input, output, session) {
#we define a reactive, something that reacts to the user's actions, when the user changes something. We select the column of the dataframe based on the value selected in the select box by the user.
dataset <- reactive({
new_data_frame %>% select(c(!!input$variable))
})
#we define a first output that we want to show, it is built-in summary function of R applied to the variable of the dataframe selected by the user, which we have called variable. We use renderPrint() function from Shiny so that it turns it into a piece of text to show to the user.
output$summary <- renderPrint({
summary(dataset())
})
#we define a second output that we want to show, it is the first five rows of the dataframe, we call head() We use renderTable() function from shiny so that it turns it into table to show to the user.
output$table <- renderTable({
head(dataset())
})
}
#we create the app
shinyApp(ui, server)
Each render{Type} function is designed to produce a particular type of output (e.g., text, tables, or plots), and is often paired with a {type}Output function. For example, in the above app, renderPrint() is paired with verbatimTextOutput() to display a statistical summary with fixed-width (verbatim) text, and renderTable() is paired with tableOutput() to show the input data in a table.
Run the code in RStudio to see the app it creates.
So far, the user can screen through the variables of the data frame and obtain summary information about the data. Now we let the user choose the data range of her choice. To do so, we need to add another select box. Based on the dates and the variable selected, we will show the total number by gender in the form of bar chart. We also restrict the variables the user can explore. We create app_ex_3.R, available on Blackboard (with comments). Run the code in RStudio to see the app it creates.
Next, we add another plot that shows a time series over time based on the variable choice. We create app_ex_4.R, available on Blackboard (with comments). Run the code in RStudio to see the app it creates.
We further customize our app. There are many layout options for us to change the way we show options.
We introduce a menu to navigate the outputs and we add some text to indicate what the dashboard is and what it allows the user to do. We use a panel layout to do so. We create app_ex_5.R, available on Blackboard (with comments). Run the code in RStudio to see the app it creates.
Next, we add a sidebar where the user will specify its inputs to concentrate them at the same place. Moreover, we play with the theme of the dashboard using the standard themes available from Shiny. We will also offer the user to exclude department 44 using a checkbox, and restrict our dataset to a total number of deaths lower than a certain threshold that the user will define. We create app_ex_6.R, available on Blackboard (with comments). Run the code in RStudio to see the app it creates.
Finally, we add map to our app, building on what we have learned. We create app_ex_7.R, available on Blackboard (with comments). To see it online, go there.
The easiest way to turn your Shiny app into a web page is to use shinyapps.io, RStudio’s hosting service for Shiny apps. shinyapps.io lets us upload our app straight from our R session to a server hosted by RStudio. We have complete control over our app including server administration tools. To find out more about shinyapps.io, please visit the above link. We need to register there, then log in.
Then we connect Shiny to the freshly created web space by going to our account and then retrieving the token information from there. We use the token information as explained here.
Once our publishing account is well configured, we do a test. We will publish on our web space the application app_ex_7.R. Open in RStudio app_ex_7.R, then click the Publish Application button. Select app_ex_7.R and the files required for the data analysis (they should be selected by default), then validate.
Pick as a title for the application My_App_7. Once successfully published, you should be able to access the webpage with the following URL: https://YOURUSERNAME.shinyapps.io/My_App_7/. Of course, replace YOURUSERNAME by the user name you have chosen when you created your own account on shinyapps.io.
To see the one hosted on my shinyapps server, click here.
We now know how to create and publish and interactive dashboard with R.
A final touch: let us say we want to generate and share a QRcode for our online application, we can easily do that too with R, we use the below code (after installing the package qrcode):
#create qrcode
library("qrcode")
png("qrplot.png")
qrcode_gen("https://agarel86.shinyapps.io/My_App_7/")
dev.off()
#show it
library("imager")
im<-load.image("qrplot.png")
plot(im)
Now we will see, step-by-step, how to create and publish and online interactive Dashboard with R (Shiny) from scratch using other sets of data. Then, as your assignment for this course, you will have to create your own application using different data, and adequate app functions, Shiny interface and Shiny server computations.
Let us consider the three following applications. For each of them, there is a commented R file that creates the associated application and a link to an online version of the app:
We want to show information about the possible side effects from COVID vaccines. The commented R code is available on Blackboard under case_1.R, and an online version is here. It requires the data file covid side effects US.csv available on Blackboard.
We want to look at the emotion and sentiment conveyed by COVID-related tweets. You can find the commented R code on Blackboard under case_2.R, and an online version is here. It requires the data file tweetid_userid_keyword_sentiments_emotions_France.csv available on Blackboard.
We want to show, for a given city, on a map, all the places where you can get vaccinated. You can find the commented R code on Blackboard under case_3.R, and an online version is here. It requires the data file centres_vaccinations.csv available on Blackboard.
As a group of three, create your own online interactive dashboard with R. The only constraint: it has be to related to COVID-19 data in France or in another country. The dashboard should allow the user to visualize the data in the most complete and relevant way and respond to the user’s inputs. You can use the data source of your choice, feel free to consider data.gouv.org or Kaggle, they are many other sources.
The marking criteria (out of 20) are:
The mark is worth 25% of the final grade for the course. The due date is (December 25, 2021).
Good luck !
You can contact your lecturer, Alexandre Garel, at agarel@audencia.com. My personal website is there.