Introduction to the R Studio Integrated Development Environment

S3729C Wellness and Health Management - Lesson 11

Class Date : 3 September 2022

Learning Outcomes for Lesson 11

At the end of this lesson, students will be able to:

Describe the various components of the R Studio Integrated Development Environment |IDE| and how to input functions, variables and arguments in the console to generate output.
Execute code for the construction of data visualisation features using the ggplot2 nomenclature, including but not limited to scatter plots, boxplots, histograms, density plots, ridgeline plots and heatmaps.
Develop appropriate data visualisation outputs to analyse raw data retrieved from a programme implementation survey, and identify potential action points arising from this data.

Setting Up Your R Studio Cloud Account

Please follow the steps outlined below to set up your R Studio Cloud Account, which will be used for the next 5 lessons in this module.

Step 01 : Go to R Studio Cloud Website https://Rstudio.cloud
Step 02 : Click on GET STARTED FOR FREE
Step 03 : Sign up for the free account.
Step 04 : When you get to the R Studio Cloud Workspace, Click on New Project.
Step 05 : Name Your Project as S3729C Data Analytics Seminar
Step 06 : In the workspace, type in the following install.packages(“tidyverse”)

Congratulations, your R Studio Cloud Account is now ready !

The R Studio IDE Workspace Components

The R Studio IDE Workspace consists of the following key components

Input Source
Console
Environment
Files | Plots | Packages | Help | Viewer

Data Types and Assignment of Variables in R

Creating and Managing Vectors in R

Loading the Packages Required in R for This Lesson

This is the code for installation of Pacman which is used to unpack all packages required in this lesson.

install.packages("pacman",repos = "http://cran.us.r-project.org")

## package 'pacman' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\aaron_chen_angus\AppData\Local\Temp\RtmpWmF5zm\downloaded_packages

We will then proceed to load the packages required for this section

pacman: for loading/unloading packages
psych: for psychometric functions
rio: for importing data
tidyverse: for data wrangling and visualisation functions
ggplot2: for the use of ggplot2 functions for visualisation
ggridges: for the generation of ridgeline plots
vioplot: for the generation of violin plots
dplyr: for data wrangling purposes
ggwordcloud: for generating word clouds
ggforce: for force fitting word clouds
gridExtra: for using the grid.arrange function

pacman::p_load(pacman, psych, rio, tidyverse, ggplot2, ggridges, devtools, vioplot, dplyr, ggwordcloud, ggforce, gridExtra)

## Error in get(genname, envir = envir) : object 'testthat_print' not found

Loading Data Files into R Studio Cloud for Analysis

We will illustrate two methods of introducing source data files for subsequent analysis in these five lessons on data science 1. Uploading files durectly into R Studio Cloud Environment 2. Accessing files via a link in GitHub

Uploading Files Directly into R Studio Cloud Environment

Step 1 :

Download the file located at this link to your local folder in your computer.
File Source : https://raw.githubusercontent.com/aaron-chen-angus/community2campus/main/C2CsurveyAggregated.csv
This file is stored in .csv format
It contains the data from the surveys conducted with participants of the following events which were conducted under the CommunityᄅCampus programme which was organised with the students of SDSSW Intake 03 (eg. NBFA Sports Model, The Next Hitmaker, Bagan Race, Kindred Spirits, Off Centre, What a Wonderful World, etc)
For more details about the CommunityᄅCampus programme feel free to visit our event website : https://community2campus.com or instagram : https://www.instagram.com/community2campus

Step 2 :

Click on the Upload button in the Files Tab
Locate the file : C2CsurveyAggregated.csv on your local drive
Upload it using the “Choose File” function

Accessing Files via a Link in GitHub

Step 1 : Go to https://github.com (sign up for an account for free if you have not done so)

Step 2 : Create a new repository

Step 3 : Upload the file into your created repository

Step 4 : Click on the RAW tab

Step 5 : Copy the URL of the Raw file for Referencing

Importing and Reading the Data File

You can now import and read the source data file from the following Github link via the read.csv command.

Data Source : https://raw.githubusercontent.com/aaron-chen-angus/community2campus/main/C2CsurveyAggregated.csv

C2Cagg <- read.csv(file = "https://raw.githubusercontent.com/aaron-chen-angus/community2campus/main/C2CsurveyAggregated.csv", header = TRUE, sep = ",")

Check on the output and hence the integrity of the loaded data by reading the column names

C2Cagg %>% colnames()

##   [1] "UNQ_ID"        "AGE.RANGE"     "GENDER"        "NCSS_B03A"    
##   [5] "NCSS_B03B"     "NCSS_B04"      "NCSS_B05"      "EVENT"        
##   [9] "PARTICIPATION" "PART_ROLE"     "AFFILIATION"   "Process_E_LS" 
##  [13] "Process_E_EX"  "Process_E_CT"  "Pre_MR1"       "Post_MR1"     
##  [17] "X_MR1"         "Pre_MR2"       "Post_MR2"      "X_MR2"        
##  [21] "Pre_MR3"       "Post_MR3"      "X_MR3"         "Pre_MR4"      
##  [25] "Post_MR4"      "X_MR4"         "Pre_MR5"       "Post_MR5"     
##  [29] "X_MR5"         "Pre_AT_PT1"    "Pre_AT_PT2"    "Pre_AT_PT3"   
##  [33] "Pre_AT_WT5"    "Pre_AT_CT1"    "Pre_AT_CT2"    "Pre_AT_CT3"   
##  [37] "Pre_AT_CT4"    "Pre_AT_LA1"    "Pre_AT_LA2"    "Pre_AT_LA3"   
##  [41] "Pre_AT_SC6"    "Pre_ST_SD1"    "Pre_ST_SD2"    "Pre_ST_SD3"   
##  [45] "Pre_ST_SD4"    "Pre_NCSS_P2"   "Pre_AT_WT2"    "Pre_AT_WT3"   
##  [49] "Pre_AT_WT4"    "Pre_AT_LA4"    "Pre_AT_WT1"    "Pre_AT_CT5"   
##  [53] "Pre_AT_LA5"    "Pre_AT_SC2"    "Pre_ST_SD5"    "Pre_ST_SR2"   
##  [57] "Pre_ST_SR1"    "Pre_ST_SR3"    "Pre_ST_SR4"    "Pre_ST_SR5"   
##  [61] "Pre_NCSS_P1"   "Pre_NCSS_P3"   "Pre_NCSS_P4"   "Pre_NCSS_P5"  
##  [65] "Pre_AT_SC1"    "Pre_AT_SC3"    "Pre_AT_SC4"    "Pre_AT_SC5"   
##  [69] "Post_AT_PT1"   "Post_AT_PT2"   "Post_AT_PT3"   "Post_AT_WT5"  
##  [73] "Post_AT_CT1"   "Post_AT_CT2"   "Post_AT_CT3"   "Post_AT_CT4"  
##  [77] "Post_AT_LA1"   "Post_AT_LA2"   "Post_AT_LA3"   "Post_ST_SD1"  
##  [81] "Post_ST_SD2"   "Post_ST_SD3"   "Post_ST_SD4"   "Post_AT_SC6"  
##  [85] "Post_NCSS_P2"  "Post_AT_WT2"   "Post_AT_WT3"   "Post_AT_WT4"  
##  [89] "Post_AT_LA4"   "Post_AT_WT1"   "Post_AT_CT5"   "Post_AT_LA5"  
##  [93] "Post_AT_SC2"   "Post_ST_SD5"   "Post_ST_SR2"   "Post_ST_SR1"  
##  [97] "Post_ST_SR3"   "Post_ST_SR4"   "Post_ST_SR5"   "Post_NCSS_P1" 
## [101] "Post_NCSS_P3"  "Post_NCSS_P4"  "Post_NCSS_P5"  "Post_AT_SC1"  
## [105] "Post_AT_SC3"   "Post_AT_SC4"   "Post_AT_SC5"

Speaking the Language of ggplot2

ggplot2 is based on the grammar of graphics, which has the following features described below.

Data Visualisation with ggplot2

ggplot2 is an R package from the Tidyverse family which is dedicated to data visualization. In this session, you will learn how to create the following plots which can help you communicate your data and findings better, and this would benefit your CWF graded assignment for both component 2 and 3.

These are the types of plots which we will cover in this session.

Histograms
Boxplots
Density Plots
Ridgeline Plots
Violin Plots
Heatmaps
Word Clouds

Enjoy !

Histograms

Histograms for Process Evaluation

We will generate a histogram to show the distribution of ratings for event logistics (field : Process_E_LS), using code structure based on dplyr pipes.

C2Cagg%>%
ggplot(aes(x=Process_E_LS)) + 
geom_histogram(binwidth=1, fill="green", color="black", alpha=0.9) +
ggtitle("Process Evaluation for Logistics") + ylim(0,1000)

Next, we will generate a histogram to show the distribution of ratings for instructors (field : Process_E_EX).

C2Cagg%>%
ggplot(aes(x=Process_E_EX)) + 
geom_histogram(binwidth=1, fill="blue", color="black", alpha=0.9) +
ggtitle("Process Evaluation for Instructors") + ylim(0,1000)

Next, we will generate a histogram to show the distribution of ratings for programme activities (field : Process_E_CT).

C2Cagg%>%
ggplot(aes(x=Process_E_CT)) + 
geom_histogram(binwidth=1, fill="orange", color="black", alpha=0.9) +
ggtitle("Process Evaluation for Programme") + ylim(0,1000)

Boxplots

Boxplots for Differentiated Process Evaluation - by Gender

We will generate a boxplot to show the differentiated ratings by gender for event logistics (field : Process_E_LS) using dplyr pipes.

C2Cagg%>%
ggplot(aes(x=GENDER, y=Process_E_LS)) + 
geom_boxplot(color="red", fill="green", alpha=0.2) + 
ggtitle("Process Evaluation for Logistics by Gender") + ylim(0,7)

We will now generate a boxplot to show the differentiated ratings by gender for instructors (field : Process_E_EX).

C2Cagg%>%
ggplot(aes(x=GENDER, y=Process_E_EX)) + 
geom_boxplot(color="red", fill="blue", alpha=0.2) + 
ggtitle("Process Evaluation for Instructors by Gender") + ylim(0,7)

We will now generate a boxplot to show the differentiated ratings by gender for programmes (field : Process_E_CT).

C2Cagg%>%
ggplot(aes(x=GENDER, y=Process_E_CT)) + 
geom_boxplot(color="red", fill="orange", alpha=0.2) + 
ggtitle("Process Evaluation for Programme by Gender") + ylim(0,7)

Boxplots for Differentiated Process Evaluation - by Event

We will generate a boxplot to show the differentiated ratings for event logistics by event (field : Process_E_LS), using dplyr pipes.

C2Cagg%>%
ggplot(aes(x=EVENT, y=Process_E_LS)) + 
geom_boxplot(color="red", fill="green", alpha=0.2) + 
ggtitle("Process Evaluation for Logistics by Event") + 
ylim(0,6) + coord_flip()

We will now generate a boxplot to show the differentiated ratings for event staff by event (field : Process_E_EX)

C2Cagg%>%
ggplot(aes(x=EVENT, y=Process_E_EX)) + 
geom_boxplot(color="red", fill="blue", alpha=0.2) + 
ggtitle("Process Evaluation for Instructors by Event") + 
ylim(0,6) + coord_flip()

We will now generate a boxplot to show the differentiated ratings for programme by event (field : Process_E_CT)

C2Cagg%>%
ggplot(aes(x=EVENT, y=Process_E_CT)) + 
geom_boxplot(color="red", fill="orange", alpha=0.2) + 
ggtitle("Process Evaluation for Instructors by Event") + 
ylim(0,6) + coord_flip()

Boxplots for Impact Evaluation - Aggregated Factors Compared Across Gender

Comparison of Gender in MR1 Aggregated Factor Impacts

C2Cagg%>%
ggplot(aes(x=GENDER, y=X_MR1, fill=GENDER)) + 
geom_boxplot() + 
ggtitle("Gender x MR1 Gains") + ylim(-3,3)

## Warning: Removed 16 rows containing non-finite values (stat_boxplot).

Comparison of Gender in MR2 Aggregated Factor Impacts

C2Cagg%>%
ggplot(aes(x=GENDER, y=X_MR2, fill=GENDER)) + 
geom_boxplot() + 
ggtitle("Gender x MR2 Gains") + ylim(-3,3)

Comparison of Gender in MR3 Aggregated Factor Impacts

C2Cagg%>%
ggplot(aes(x=GENDER, y=X_MR3, fill=GENDER)) + 
geom_boxplot() + 
ggtitle("Gender x MR3 Gains") + ylim(-3,3)

## Warning: Removed 49 rows containing non-finite values (stat_boxplot).

Comparison of Gender in MR4 Aggregated Factor Impacts

C2Cagg%>%
ggplot(aes(x=GENDER, y=X_MR4, fill=GENDER)) + 
geom_boxplot() + 
ggtitle("Gender x MR4 Gains") + ylim(-3,3)

Comparison of Gender in MR5 Aggregated Factor Impacts

C2Cagg%>%
ggplot(aes(x=GENDER, y=X_MR5, fill=GENDER)) + 
geom_boxplot() + 
ggtitle("Gender x MR5 Gains") + ylim(-3,3)

## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

Density Plots

Density Plots for Aggregated Factors Compared Across Experience

The following density plots are constructed to ascertain the impact of knowing someone with a mental health condition (field : NCSS_B03A), on the impact of specific aggregated factors.

Density Plot for NCSS_B03A and MR1

C2Cagg%>%
ggplot(aes(x=X_MR1, color=NCSS_B03A, fill=NCSS_B03A)) +
geom_density(alpha=0.3,size=1)+ 
labs(x= "Change in MR1 Aggregated Scores Following Event",
subtitle="",
caption="Kruskal-Wallis chi-squared = 148.25, df = 70, p-value = 1.495e-07")

Density Plot for NCSS_B03A and MR2

C2Cagg%>%
ggplot(aes(x=X_MR2, color=NCSS_B03A, fill=NCSS_B03A)) +
geom_density(alpha=0.3,size=1)+ 
labs(x= "Change in MR2 Aggregated Scores Following Event",
subtitle="",
caption="Kruskal-Wallis chi-squared = 45.786, df = 15, p-value = 5.749e-05")

Density Plot for NCSS_B03A and MR3

C2Cagg%>%
ggplot(aes(x=X_MR3, color=NCSS_B03A, fill=NCSS_B03A)) +
geom_density(alpha=0.3,size=1)+ 
labs(x= "Change in MR3 Aggregated Scores Following Event",
subtitle="",
caption="Kruskal-Wallis chi-squared = 125.34, df = 50, p-value = 2.1e-08")

Density Plot for NCSS_B03A and MR4

C2Cagg%>%
ggplot(aes(x=X_MR4, color=NCSS_B03A, fill=NCSS_B03A)) +
geom_density(alpha=0.3,size=1)+ 
labs(x= "Change in MR4 Aggregated Scores Following Event",
subtitle="",
caption="Kruskal-Wallis chi-squared = 44.57, df = 16, p-value = 0.0001615")

Density Plot for NCSS_B03A and MR5

C2Cagg%>%
ggplot(aes(x=X_MR5, color=NCSS_B03A, fill=NCSS_B03A)) +
geom_density(alpha=0.3,size=1)+ 
labs(x= "Change in MR5 Aggregated Scores Following Event",
subtitle="",
caption="Kruskal-Wallis chi-squared = 42.013, df = 20, p-value = 0.002755")

Ridgeline Plots

Ridgeline Plots for Aggregated Factors Compared Across Education Level

The following ridgeline plots are constructed to ascertain the impact of participants’ education level (field : NCSS_B05A), on the impact of specific aggregated factors.

Ridgeline Plot for NCSS_B05A and MR1

C2Cagg%>%
ggplot(aes(x = X_MR1, y = NCSS_B05, fill = NCSS_B05)) +
geom_density_ridges() +
theme_ridges() + 
theme(legend.position = "none")

## Picking joint bandwidth of 0.355

Ridgeline Plot for NCSS_B05A and MR2

C2Cagg%>%
ggplot(aes(x = X_MR2, y = NCSS_B05, fill = NCSS_B05)) +
geom_density_ridges() +
theme_ridges() + 
theme(legend.position = "none")

## Picking joint bandwidth of 0.211

Ridgeline Plot for NCSS_B05A and MR3

C2Cagg%>%
ggplot(aes(x = X_MR3, y = NCSS_B05, fill = NCSS_B05)) +
geom_density_ridges() +
theme_ridges() + 
theme(legend.position = "none")

## Picking joint bandwidth of 0.393

Density Plot for NCSS_B05A and MR4

C2Cagg%>%
ggplot(aes(x = X_MR4, y = NCSS_B05, fill = NCSS_B05)) +
geom_density_ridges() +
theme_ridges() + 
theme(legend.position = "none")

## Picking joint bandwidth of 0.233

Ridgeline Plot for NCSS_B05A and MR5

C2Cagg%>%
ggplot(aes(x = X_MR5, y = NCSS_B05, fill = NCSS_B05)) +
geom_density_ridges() +
theme_ridges() + 
theme(legend.position = "none")

## Picking joint bandwidth of 0.259

Violin Plots

Violin Plots for Aggregated Factors Compared Across Engagement Level

The following violin plots are constructed to ascertain the impact of participants’ engagement level (field : PARTICIPATION), on the impact of specific aggregated factors.

First, we will need to ensure the PARTICIPATION field is a factor.

C2Cagg$PARTICIPATION <- as.factor(C2Cagg$PARTICIPATION)

Violin Plot for PARTICIPATION and MR1

C2Cagg%>%
ggplot(aes(x=PARTICIPATION, 
           y=X_MR1, 
           fill=PARTICIPATION)) + 
  geom_violin()

Violin Plot for PARTICIPATION and MR2

C2Cagg%>%
ggplot(aes(x=PARTICIPATION, 
           y=X_MR2, 
           fill=PARTICIPATION)) + 
  geom_violin()

Violin Plot for PARTICIPATION and MR3

C2Cagg%>%
ggplot(aes(x=PARTICIPATION, 
           y=X_MR3, 
           fill=PARTICIPATION)) + 
geom_violin()

Violin Plot for PARTICIPATION and MR4

C2Cagg%>%
ggplot(aes(x=PARTICIPATION, 
           y=X_MR4, 
           fill=PARTICIPATION)) + 
geom_violin()

Violin Plot for PARTICIPATION and MR5

C2Cagg%>%
ggplot(aes(x=PARTICIPATION, 
           y=X_MR5, 
           fill=PARTICIPATION)) + 
geom_violin()

Heatmaps

Building a Heatmap with a Data Subset

We will build a simple heatmap to show the impact of various events on the aggregated factors.

Impact on MR1 Compared Across Various Events

C2Cagg%>%
ggplot(aes(x=X_MR1, y=EVENT, fill = X_MR1)) + 
geom_tile() + 
xlab(label = "Impact on MR1 : Addressing Stigma") +
scale_fill_gradient(name = "MR1 Shift",
                      low = "#FFFFFF",
                      high = "#012345")

Impact on MR2 Compared Across Various Events

C2Cagg%>%
ggplot(aes(x=X_MR2, y=EVENT, fill = X_MR2)) + 
geom_tile() + 
xlab(label = "Impact on MR2 : Mental Health Literacy : Wishful Thinking") +
scale_fill_gradient(name = "MR2 Shift",
                      low = "white",
                      high = "blue")

Impact on MR3 Compared Across Various Events

C2Cagg%>%
ggplot(aes(x=X_MR3, y=EVENT, fill = X_MR3)) + 
geom_tile() + 
xlab(label = "Impact on MR3 : Promoting Mental Health Advocacy") +
scale_fill_distiller(name = "MR3 Shift",
                      palette = "RdPu")

Impact on MR4 Compared Across Various Events

C2Cagg%>%
ggplot(aes(x=X_MR4, y=EVENT, fill = X_MR4)) + 
geom_tile() + 
xlab(label = "Impact on MR4 : Mental Health Literacy : Relationships") +
scale_fill_distiller(name = "MR4 Shift",
                      palette = "Spectral")

Impact on MR5 Compared Across Various Events

C2Cagg%>%
ggplot(aes(x=X_MR5, y=EVENT, fill = X_MR5)) + 
geom_tile() + 
xlab(label = "Impact on MR5 : Mental Health Literacy : Social Constructivism") +
scale_fill_distiller(name = "MR5 Shift",
                      palette = "Pastel1")

Colour Palette

Note that the color palette was varied in the heatmap examples above using two different methods

scale_fill_gradient() to provide extreme colors of the palette
scale_fill_distiller() to provide a ColorBrewer palette

Word Clouds

Sample Data for Creation of Word Clouds

In order to create a word cloud with ggwordcloud we would need at least a data frame containing the words and optionally a numerical column which will be used to scale the texts. In this section we will use the thankyou_words_small data set from the package for illustration purposes. You will be taught how to create a corpus based on qualitative datasets which can be used to generate word clouds in a subsequent lesson.

# install.packages("ggwordcloud")
library(ggwordcloud)
df <- thankyou_words_small

Basic Word Cloud

ggwordcloud provides a ggplot2 geom named geom_text_wordcloud for creating word clouds. Use your data frame and pass the column containing the texts to the label argument of aes and use the geom_text_wordcloud function. Note that we are setting a seed to keep the example reproducible, as the algorithm used for placing the texts involves some randomness.

# load "ggwordcloud"
library(ggwordcloud)

# Data
df <- thankyou_words_small

set.seed(1)
ggplot(df, aes(label = word)) +
  geom_text_wordcloud() +
  theme_minimal()

Size of the Text based on a Variable

So far all the words had the same size. If you want to set the size based on a numerical variable you can pass it to the size argument of aes.

# install.packages("ggwordcloud")
library(ggwordcloud)

# Data
df <- thankyou_words_small

set.seed(1)
ggplot(df, aes(label = word, size = speakers)) +
  geom_text_wordcloud() +
  theme_minimal()

Basic Word Cloud with Base R Syntax

Alternatively, you could use the ggwordcloud and specify the words and frequency (which will determine the relative size of each text) to create the word cloud with a single function. Note that this function provides more arguments which you can customize.

# install.packages("ggwordcloud")
library(ggwordcloud)

# Data
df <- thankyou_words_small

set.seed(1)
ggwordcloud(words = df$word, freq = df$speakers)

Scaling (Font Size)

The default text scaling of ggplot2 (square root scaling) makes the word cloud look small respect to the plot area. For this reason, you could use the scale_size_area function as follows to obtain a better font size control.

# install.packages("ggwordcloud")
library(ggwordcloud)

# Data
df <- thankyou_words_small

set.seed(1)
ggplot(df, aes(label = word, size = speakers)) +
  geom_text_wordcloud() +
  scale_size_area(max_size = 20) +
  theme_minimal()

Removing Texts that Do Not Fit

If you have too many words and a big font size you can set the rm_outside argument of geom_text_wordcloud to TRUE or decrease the font size to remove the overflowing texts.

# install.packages("ggwordcloud")
library(ggwordcloud)
# install.packages("ggforce")
library(ggforce)

# Data
df <- thankyou_words_small

set.seed(1)
ggplot(df, aes(label = word, size = speakers)) +
  geom_text_wordcloud(rm_outside = TRUE) +
  scale_size_area(max_size = 60) +
  theme_minimal()

## Some words could not fit on page. They have been removed.

Text Rotation

Note that you can also rotate the texts with the angle argument of aes. In the following example we are creating a new column randomly to represent the desired angles to rotate each text.

# install.packages("ggwordcloud")
library(ggwordcloud)

set.seed(1)

# Data
df <- thankyou_words_small
df$angle <- sample(c(0, 45, 60, 90, 120, 180), nrow(df), replace = TRUE)

ggplot(df, aes(label = word, size = speakers, angle = angle)) +
  geom_text_wordcloud() +
  scale_size_area(max_size = 20) +
  theme_minimal()

Shape of the Word Cloud

By default, the shape of the word cloud is circular. However, it is possible to change the shape of the cloud with the shape argument of the geom_text_wordcloud function. Possible shapes are named “circle” (default), “cardioid”, “diamond”, “pentagon”, “star”, “square”, “triangle-forward” and “triangle-upright”. In the following blocks of code you can check a couple of examples.

Diamond Shape

# install.packages("ggwordcloud")
library(ggwordcloud)

# Data
df <- thankyou_words_small

set.seed(1)
ggplot(df, aes(label = word, size = speakers)) +
  geom_text_wordcloud(shape = "diamond") +
  scale_size_area(max_size = 20) +
  theme_minimal()

Star Shape

# install.packages("ggwordcloud")
library(ggwordcloud)

# Data
df <- thankyou_words_small

set.seed(1)
ggplot(df, aes(label = word, size = speakers)) +
  geom_text_wordcloud(shape = "star") +
  scale_size_area(max_size = 20) +
  theme_minimal()

Using a Mask

Alternatively you can use a PNG file to create a mask to place the words within it. Note that the non-transparent pixels of the image will be used as the mask. In the following example we are using a sample PNG file from the ggwordcloud package with the shape of a heart to create the mask.

# install.packages("ggwordcloud")
library(ggwordcloud)

# Data
df <- thankyou_words_small

# Mask
mask_png <- png::readPNG(system.file("extdata/hearth.png",
      package = "ggwordcloud", mustWork = TRUE))

set.seed(1)
ggplot(df, aes(label = word, size = speakers)) +
  geom_text_wordcloud(mask = mask_png) +
  scale_size_area(max_size = 20) +
  theme_minimal()

Color of the Texts

Unique Color

When creating a word cloud with ggwordcloud the color of the texts is black by default. Nevertheless, you can customize the color passing a color to the color argument of geom_text_wordcloud.

# install.packages("ggwordcloud")
library(ggwordcloud)

# Data
df <- thankyou_words_small

set.seed(1)
ggplot(df, aes(label = word, size = speakers)) +
  geom_text_wordcloud(color = "red") +
  scale_size_area(max_size = 20) +
  scale_color_discrete("red") +
  theme_minimal()

Color based on a Variable

You can also set the color based on a categorical variable. This will allow you to color the text by groups or setting a different color for each text, as in the example below.

# install.packages("ggwordcloud")
library(ggwordcloud)

# Data
df <- thankyou_words_small

set.seed(1)
ggplot(df, aes(label = word, size = speakers, color = name)) +
  geom_text_wordcloud() +
  scale_size_area(max_size = 20) +
  theme_minimal()

Color Gradient

Finally, if you want to create a gradient you can pass a numerical variable to the color argument of aes and use a color scale such as scale_color_gradient to create the gradient color scale.

# install.packages("ggwordcloud")
library(ggwordcloud)

# Data
df <- thankyou_words_small

set.seed(1)
ggplot(df, aes(label = word, size = speakers, color = speakers)) +
  geom_text_wordcloud() +
  scale_size_area(max_size = 20) +
  theme_minimal() +
  scale_color_gradient(low = "darkred", high = "red")

Try It Out !

Developing Visualisations for Process and Impact Evaluation Survey Data

In your teams please attempt questions 1 and 2 below.

Develop data visualisations using ggplot2 code that would display how a specific process indicator value (eg. any input, activity or output indicator) is differentiated between demographic factors (eg. gender, age range). This may be based on the survey that you have created for CWF Component 2.

Possible Process Indicators - Input : Sound and Video Quality, Talents - Activity : Duration of the Video - Output : No of Views

Possible Graph Types - Boxplot differentiated by demographic factor - Density Plot differentiated by demographic factor

We will generate a boxplot to show the differentiated ratings by gender for event logistics (field : Process_E_LS) using dplyr pipes.

Example of a boxplot showing differentiated ratings for event logistics by gender

C2Cagg%>%
ggplot(aes(x=GENDER, y=Process_E_LS)) + 
geom_boxplot(color="red", fill="green", alpha=0.2) + 
ggtitle("Process Evaluation for Logistics by Gender") + ylim(0,7)

Develop a single plot containing multiple histograms representing the distribution of the various impact indicators (knowledge, skills, attitudes, behaviour) in the surveyed population.

Example of a grid plot showing three histograms with distribution of evaluation for logistics, instructors and programme

#load the gridExtra package
library(gridExtra)

#plot the first histogram in column 1 without dplyr pipes
Histogram01 <- ggplot(C2Cagg, aes(x=Process_E_LS)) + 
geom_histogram(binwidth=1, fill="green", color="black", alpha=0.9) +
ggtitle("Logistics") + ylim(0,1000)

#plot the second histogram in column 2 without dplyr pipes
Histogram02 <- ggplot(C2Cagg, aes(x=Process_E_EX)) + 
geom_histogram(binwidth=1, fill="blue", color="black", alpha=0.9) +
ggtitle("Instructors") + ylim(0,1000)

#plot the third histogram in column 3 without dplyr pipes
Histogram03 <- ggplot(C2Cagg, aes(x=Process_E_CT)) + 
geom_histogram(binwidth=1, fill="orange", color="black", alpha=0.9) +
ggtitle("Programme") + ylim(0,1000)

#generate the multi-faceted plot using the grid.arrange function
grid.arrange(Histogram01, Histogram02, Histogram03, ncol = 3)

References

ggplot2: Elegant Graphics for Data Analysis (Use R!)

by Hadley Wickham
online version : https://ggplot2-book.org/

citation : Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. (3rd ed). Routledge

R Graphics Cookbook, 2nd edition

by Winston Chang
online version : https://r-graphics.org/

citation : Chang, W. (2022). R Graphics Cookbook. (2nd ed). O’Reilly