S3729C Wellness and Health Management - Lesson 11
Class Date : 3 September 2022
At the end of this lesson, students will be able to:
Please follow the steps outlined below to set up your R Studio Cloud Account, which will be used for the next 5 lessons in this module.
Congratulations, your R Studio Cloud Account is now ready !
The R Studio IDE Workspace consists of the following key components
This is the code for installation of Pacman which is used to unpack all packages required in this lesson.
## package 'pacman' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\aaron_chen_angus\AppData\Local\Temp\RtmpWmF5zm\downloaded_packages
We will then proceed to load the packages required for this section
## Error in get(genname, envir = envir) : object 'testthat_print' not found
We will illustrate two methods of introducing source data files for subsequent analysis in these five lessons on data science 1. Uploading files durectly into R Studio Cloud Environment 2. Accessing files via a link in GitHub
Step 1 :
Step 2 :
Step 1 : Go to https://github.com (sign up for an account for free if you have not done so)
Step 2 : Create a new repository
Step 3 : Upload the file into your created repository
Step 4 : Click on the RAW tab
Step 5 : Copy the URL of the Raw file for Referencing
You can now import and read the source data file from the following Github link via the read.csv command.
Data Source : https://raw.githubusercontent.com/aaron-chen-angus/community2campus/main/C2CsurveyAggregated.csv
Check on the output and hence the integrity of the loaded data by reading the column names
## [1] "UNQ_ID" "AGE.RANGE" "GENDER" "NCSS_B03A"
## [5] "NCSS_B03B" "NCSS_B04" "NCSS_B05" "EVENT"
## [9] "PARTICIPATION" "PART_ROLE" "AFFILIATION" "Process_E_LS"
## [13] "Process_E_EX" "Process_E_CT" "Pre_MR1" "Post_MR1"
## [17] "X_MR1" "Pre_MR2" "Post_MR2" "X_MR2"
## [21] "Pre_MR3" "Post_MR3" "X_MR3" "Pre_MR4"
## [25] "Post_MR4" "X_MR4" "Pre_MR5" "Post_MR5"
## [29] "X_MR5" "Pre_AT_PT1" "Pre_AT_PT2" "Pre_AT_PT3"
## [33] "Pre_AT_WT5" "Pre_AT_CT1" "Pre_AT_CT2" "Pre_AT_CT3"
## [37] "Pre_AT_CT4" "Pre_AT_LA1" "Pre_AT_LA2" "Pre_AT_LA3"
## [41] "Pre_AT_SC6" "Pre_ST_SD1" "Pre_ST_SD2" "Pre_ST_SD3"
## [45] "Pre_ST_SD4" "Pre_NCSS_P2" "Pre_AT_WT2" "Pre_AT_WT3"
## [49] "Pre_AT_WT4" "Pre_AT_LA4" "Pre_AT_WT1" "Pre_AT_CT5"
## [53] "Pre_AT_LA5" "Pre_AT_SC2" "Pre_ST_SD5" "Pre_ST_SR2"
## [57] "Pre_ST_SR1" "Pre_ST_SR3" "Pre_ST_SR4" "Pre_ST_SR5"
## [61] "Pre_NCSS_P1" "Pre_NCSS_P3" "Pre_NCSS_P4" "Pre_NCSS_P5"
## [65] "Pre_AT_SC1" "Pre_AT_SC3" "Pre_AT_SC4" "Pre_AT_SC5"
## [69] "Post_AT_PT1" "Post_AT_PT2" "Post_AT_PT3" "Post_AT_WT5"
## [73] "Post_AT_CT1" "Post_AT_CT2" "Post_AT_CT3" "Post_AT_CT4"
## [77] "Post_AT_LA1" "Post_AT_LA2" "Post_AT_LA3" "Post_ST_SD1"
## [81] "Post_ST_SD2" "Post_ST_SD3" "Post_ST_SD4" "Post_AT_SC6"
## [85] "Post_NCSS_P2" "Post_AT_WT2" "Post_AT_WT3" "Post_AT_WT4"
## [89] "Post_AT_LA4" "Post_AT_WT1" "Post_AT_CT5" "Post_AT_LA5"
## [93] "Post_AT_SC2" "Post_ST_SD5" "Post_ST_SR2" "Post_ST_SR1"
## [97] "Post_ST_SR3" "Post_ST_SR4" "Post_ST_SR5" "Post_NCSS_P1"
## [101] "Post_NCSS_P3" "Post_NCSS_P4" "Post_NCSS_P5" "Post_AT_SC1"
## [105] "Post_AT_SC3" "Post_AT_SC4" "Post_AT_SC5"
ggplot2 is based on the grammar of graphics, which has the following features described below.
ggplot2 is an R package from the Tidyverse family which is dedicated to data visualization. In this session, you will learn how to create the following plots which can help you communicate your data and findings better, and this would benefit your CWF graded assignment for both component 2 and 3.
These are the types of plots which we will cover in this session.
Enjoy !
We will generate a histogram to show the distribution of ratings for event logistics (field : Process_E_LS), using code structure based on dplyr pipes.
C2Cagg%>%
ggplot(aes(x=Process_E_LS)) +
geom_histogram(binwidth=1, fill="green", color="black", alpha=0.9) +
ggtitle("Process Evaluation for Logistics") + ylim(0,1000)Next, we will generate a histogram to show the distribution of ratings for instructors (field : Process_E_EX).
C2Cagg%>%
ggplot(aes(x=Process_E_EX)) +
geom_histogram(binwidth=1, fill="blue", color="black", alpha=0.9) +
ggtitle("Process Evaluation for Instructors") + ylim(0,1000)Next, we will generate a histogram to show the distribution of ratings for programme activities (field : Process_E_CT).
C2Cagg%>%
ggplot(aes(x=Process_E_CT)) +
geom_histogram(binwidth=1, fill="orange", color="black", alpha=0.9) +
ggtitle("Process Evaluation for Programme") + ylim(0,1000)We will generate a boxplot to show the differentiated ratings by gender for event logistics (field : Process_E_LS) using dplyr pipes.
C2Cagg%>%
ggplot(aes(x=GENDER, y=Process_E_LS)) +
geom_boxplot(color="red", fill="green", alpha=0.2) +
ggtitle("Process Evaluation for Logistics by Gender") + ylim(0,7)We will now generate a boxplot to show the differentiated ratings by gender for instructors (field : Process_E_EX).
C2Cagg%>%
ggplot(aes(x=GENDER, y=Process_E_EX)) +
geom_boxplot(color="red", fill="blue", alpha=0.2) +
ggtitle("Process Evaluation for Instructors by Gender") + ylim(0,7)We will now generate a boxplot to show the differentiated ratings by gender for programmes (field : Process_E_CT).
C2Cagg%>%
ggplot(aes(x=GENDER, y=Process_E_CT)) +
geom_boxplot(color="red", fill="orange", alpha=0.2) +
ggtitle("Process Evaluation for Programme by Gender") + ylim(0,7)We will generate a boxplot to show the differentiated ratings for event logistics by event (field : Process_E_LS), using dplyr pipes.
C2Cagg%>%
ggplot(aes(x=EVENT, y=Process_E_LS)) +
geom_boxplot(color="red", fill="green", alpha=0.2) +
ggtitle("Process Evaluation for Logistics by Event") +
ylim(0,6) + coord_flip()We will now generate a boxplot to show the differentiated ratings for event staff by event (field : Process_E_EX)
C2Cagg%>%
ggplot(aes(x=EVENT, y=Process_E_EX)) +
geom_boxplot(color="red", fill="blue", alpha=0.2) +
ggtitle("Process Evaluation for Instructors by Event") +
ylim(0,6) + coord_flip()We will now generate a boxplot to show the differentiated ratings for programme by event (field : Process_E_CT)
C2Cagg%>%
ggplot(aes(x=EVENT, y=Process_E_CT)) +
geom_boxplot(color="red", fill="orange", alpha=0.2) +
ggtitle("Process Evaluation for Instructors by Event") +
ylim(0,6) + coord_flip()Comparison of Gender in MR1 Aggregated Factor Impacts
C2Cagg%>%
ggplot(aes(x=GENDER, y=X_MR1, fill=GENDER)) +
geom_boxplot() +
ggtitle("Gender x MR1 Gains") + ylim(-3,3)## Warning: Removed 16 rows containing non-finite values (stat_boxplot).
Comparison of Gender in MR2 Aggregated Factor Impacts
C2Cagg%>%
ggplot(aes(x=GENDER, y=X_MR2, fill=GENDER)) +
geom_boxplot() +
ggtitle("Gender x MR2 Gains") + ylim(-3,3)Comparison of Gender in MR3 Aggregated Factor Impacts
C2Cagg%>%
ggplot(aes(x=GENDER, y=X_MR3, fill=GENDER)) +
geom_boxplot() +
ggtitle("Gender x MR3 Gains") + ylim(-3,3)## Warning: Removed 49 rows containing non-finite values (stat_boxplot).
Comparison of Gender in MR4 Aggregated Factor Impacts
C2Cagg%>%
ggplot(aes(x=GENDER, y=X_MR4, fill=GENDER)) +
geom_boxplot() +
ggtitle("Gender x MR4 Gains") + ylim(-3,3)Comparison of Gender in MR5 Aggregated Factor Impacts
C2Cagg%>%
ggplot(aes(x=GENDER, y=X_MR5, fill=GENDER)) +
geom_boxplot() +
ggtitle("Gender x MR5 Gains") + ylim(-3,3)## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
The following density plots are constructed to ascertain the impact of knowing someone with a mental health condition (field : NCSS_B03A), on the impact of specific aggregated factors.
Density Plot for NCSS_B03A and MR1
C2Cagg%>%
ggplot(aes(x=X_MR1, color=NCSS_B03A, fill=NCSS_B03A)) +
geom_density(alpha=0.3,size=1)+
labs(x= "Change in MR1 Aggregated Scores Following Event",
subtitle="",
caption="Kruskal-Wallis chi-squared = 148.25, df = 70, p-value = 1.495e-07")Density Plot for NCSS_B03A and MR2
C2Cagg%>%
ggplot(aes(x=X_MR2, color=NCSS_B03A, fill=NCSS_B03A)) +
geom_density(alpha=0.3,size=1)+
labs(x= "Change in MR2 Aggregated Scores Following Event",
subtitle="",
caption="Kruskal-Wallis chi-squared = 45.786, df = 15, p-value = 5.749e-05")Density Plot for NCSS_B03A and MR3
C2Cagg%>%
ggplot(aes(x=X_MR3, color=NCSS_B03A, fill=NCSS_B03A)) +
geom_density(alpha=0.3,size=1)+
labs(x= "Change in MR3 Aggregated Scores Following Event",
subtitle="",
caption="Kruskal-Wallis chi-squared = 125.34, df = 50, p-value = 2.1e-08")Density Plot for NCSS_B03A and MR4
C2Cagg%>%
ggplot(aes(x=X_MR4, color=NCSS_B03A, fill=NCSS_B03A)) +
geom_density(alpha=0.3,size=1)+
labs(x= "Change in MR4 Aggregated Scores Following Event",
subtitle="",
caption="Kruskal-Wallis chi-squared = 44.57, df = 16, p-value = 0.0001615")Density Plot for NCSS_B03A and MR5
C2Cagg%>%
ggplot(aes(x=X_MR5, color=NCSS_B03A, fill=NCSS_B03A)) +
geom_density(alpha=0.3,size=1)+
labs(x= "Change in MR5 Aggregated Scores Following Event",
subtitle="",
caption="Kruskal-Wallis chi-squared = 42.013, df = 20, p-value = 0.002755")The following ridgeline plots are constructed to ascertain the impact of participants’ education level (field : NCSS_B05A), on the impact of specific aggregated factors.
Ridgeline Plot for NCSS_B05A and MR1
C2Cagg%>%
ggplot(aes(x = X_MR1, y = NCSS_B05, fill = NCSS_B05)) +
geom_density_ridges() +
theme_ridges() +
theme(legend.position = "none")## Picking joint bandwidth of 0.355
Ridgeline Plot for NCSS_B05A and MR2
C2Cagg%>%
ggplot(aes(x = X_MR2, y = NCSS_B05, fill = NCSS_B05)) +
geom_density_ridges() +
theme_ridges() +
theme(legend.position = "none")## Picking joint bandwidth of 0.211
Ridgeline Plot for NCSS_B05A and MR3
C2Cagg%>%
ggplot(aes(x = X_MR3, y = NCSS_B05, fill = NCSS_B05)) +
geom_density_ridges() +
theme_ridges() +
theme(legend.position = "none")## Picking joint bandwidth of 0.393
Density Plot for NCSS_B05A and MR4
C2Cagg%>%
ggplot(aes(x = X_MR4, y = NCSS_B05, fill = NCSS_B05)) +
geom_density_ridges() +
theme_ridges() +
theme(legend.position = "none")## Picking joint bandwidth of 0.233
Ridgeline Plot for NCSS_B05A and MR5
C2Cagg%>%
ggplot(aes(x = X_MR5, y = NCSS_B05, fill = NCSS_B05)) +
geom_density_ridges() +
theme_ridges() +
theme(legend.position = "none")## Picking joint bandwidth of 0.259
The following violin plots are constructed to ascertain the impact of participants’ engagement level (field : PARTICIPATION), on the impact of specific aggregated factors.
First, we will need to ensure the PARTICIPATION field is a factor.
Violin Plot for PARTICIPATION and MR1
Violin Plot for PARTICIPATION and MR2
Violin Plot for PARTICIPATION and MR3
Violin Plot for PARTICIPATION and MR4
Violin Plot for PARTICIPATION and MR5
We will build a simple heatmap to show the impact of various events on the aggregated factors.
Impact on MR1 Compared Across Various Events
C2Cagg%>%
ggplot(aes(x=X_MR1, y=EVENT, fill = X_MR1)) +
geom_tile() +
xlab(label = "Impact on MR1 : Addressing Stigma") +
scale_fill_gradient(name = "MR1 Shift",
low = "#FFFFFF",
high = "#012345")Impact on MR2 Compared Across Various Events
C2Cagg%>%
ggplot(aes(x=X_MR2, y=EVENT, fill = X_MR2)) +
geom_tile() +
xlab(label = "Impact on MR2 : Mental Health Literacy : Wishful Thinking") +
scale_fill_gradient(name = "MR2 Shift",
low = "white",
high = "blue")Impact on MR3 Compared Across Various Events
C2Cagg%>%
ggplot(aes(x=X_MR3, y=EVENT, fill = X_MR3)) +
geom_tile() +
xlab(label = "Impact on MR3 : Promoting Mental Health Advocacy") +
scale_fill_distiller(name = "MR3 Shift",
palette = "RdPu")Impact on MR4 Compared Across Various Events
C2Cagg%>%
ggplot(aes(x=X_MR4, y=EVENT, fill = X_MR4)) +
geom_tile() +
xlab(label = "Impact on MR4 : Mental Health Literacy : Relationships") +
scale_fill_distiller(name = "MR4 Shift",
palette = "Spectral")Impact on MR5 Compared Across Various Events
C2Cagg%>%
ggplot(aes(x=X_MR5, y=EVENT, fill = X_MR5)) +
geom_tile() +
xlab(label = "Impact on MR5 : Mental Health Literacy : Social Constructivism") +
scale_fill_distiller(name = "MR5 Shift",
palette = "Pastel1")Note that the color palette was varied in the heatmap examples above using two different methods
In order to create a word cloud with ggwordcloud we would need at least a data frame containing the words and optionally a numerical column which will be used to scale the texts. In this section we will use the thankyou_words_small data set from the package for illustration purposes. You will be taught how to create a corpus based on qualitative datasets which can be used to generate word clouds in a subsequent lesson.
ggwordcloud provides a ggplot2 geom named geom_text_wordcloud for creating word clouds. Use your data frame and pass the column containing the texts to the label argument of aes and use the geom_text_wordcloud function. Note that we are setting a seed to keep the example reproducible, as the algorithm used for placing the texts involves some randomness.
# load "ggwordcloud"
library(ggwordcloud)
# Data
df <- thankyou_words_small
set.seed(1)
ggplot(df, aes(label = word)) +
geom_text_wordcloud() +
theme_minimal()So far all the words had the same size. If you want to set the size based on a numerical variable you can pass it to the size argument of aes.
# install.packages("ggwordcloud")
library(ggwordcloud)
# Data
df <- thankyou_words_small
set.seed(1)
ggplot(df, aes(label = word, size = speakers)) +
geom_text_wordcloud() +
theme_minimal()Alternatively, you could use the ggwordcloud and specify the words and frequency (which will determine the relative size of each text) to create the word cloud with a single function. Note that this function provides more arguments which you can customize.
# install.packages("ggwordcloud")
library(ggwordcloud)
# Data
df <- thankyou_words_small
set.seed(1)
ggwordcloud(words = df$word, freq = df$speakers)The default text scaling of ggplot2 (square root scaling) makes the word cloud look small respect to the plot area. For this reason, you could use the scale_size_area function as follows to obtain a better font size control.
# install.packages("ggwordcloud")
library(ggwordcloud)
# Data
df <- thankyou_words_small
set.seed(1)
ggplot(df, aes(label = word, size = speakers)) +
geom_text_wordcloud() +
scale_size_area(max_size = 20) +
theme_minimal()If you have too many words and a big font size you can set the rm_outside argument of geom_text_wordcloud to TRUE or decrease the font size to remove the overflowing texts.
# install.packages("ggwordcloud")
library(ggwordcloud)
# install.packages("ggforce")
library(ggforce)
# Data
df <- thankyou_words_small
set.seed(1)
ggplot(df, aes(label = word, size = speakers)) +
geom_text_wordcloud(rm_outside = TRUE) +
scale_size_area(max_size = 60) +
theme_minimal()## Some words could not fit on page. They have been removed.
Note that you can also rotate the texts with the angle argument of aes. In the following example we are creating a new column randomly to represent the desired angles to rotate each text.
# install.packages("ggwordcloud")
library(ggwordcloud)
set.seed(1)
# Data
df <- thankyou_words_small
df$angle <- sample(c(0, 45, 60, 90, 120, 180), nrow(df), replace = TRUE)
ggplot(df, aes(label = word, size = speakers, angle = angle)) +
geom_text_wordcloud() +
scale_size_area(max_size = 20) +
theme_minimal()By default, the shape of the word cloud is circular. However, it is possible to change the shape of the cloud with the shape argument of the geom_text_wordcloud function. Possible shapes are named “circle” (default), “cardioid”, “diamond”, “pentagon”, “star”, “square”, “triangle-forward” and “triangle-upright”. In the following blocks of code you can check a couple of examples.
Diamond Shape
# install.packages("ggwordcloud")
library(ggwordcloud)
# Data
df <- thankyou_words_small
set.seed(1)
ggplot(df, aes(label = word, size = speakers)) +
geom_text_wordcloud(shape = "diamond") +
scale_size_area(max_size = 20) +
theme_minimal()Star Shape
# install.packages("ggwordcloud")
library(ggwordcloud)
# Data
df <- thankyou_words_small
set.seed(1)
ggplot(df, aes(label = word, size = speakers)) +
geom_text_wordcloud(shape = "star") +
scale_size_area(max_size = 20) +
theme_minimal()Alternatively you can use a PNG file to create a mask to place the words within it. Note that the non-transparent pixels of the image will be used as the mask. In the following example we are using a sample PNG file from the ggwordcloud package with the shape of a heart to create the mask.
# install.packages("ggwordcloud")
library(ggwordcloud)
# Data
df <- thankyou_words_small
# Mask
mask_png <- png::readPNG(system.file("extdata/hearth.png",
package = "ggwordcloud", mustWork = TRUE))
set.seed(1)
ggplot(df, aes(label = word, size = speakers)) +
geom_text_wordcloud(mask = mask_png) +
scale_size_area(max_size = 20) +
theme_minimal()Unique Color
When creating a word cloud with ggwordcloud the color of the texts is black by default. Nevertheless, you can customize the color passing a color to the color argument of geom_text_wordcloud.
# install.packages("ggwordcloud")
library(ggwordcloud)
# Data
df <- thankyou_words_small
set.seed(1)
ggplot(df, aes(label = word, size = speakers)) +
geom_text_wordcloud(color = "red") +
scale_size_area(max_size = 20) +
scale_color_discrete("red") +
theme_minimal()Color based on a Variable
You can also set the color based on a categorical variable. This will allow you to color the text by groups or setting a different color for each text, as in the example below.
# install.packages("ggwordcloud")
library(ggwordcloud)
# Data
df <- thankyou_words_small
set.seed(1)
ggplot(df, aes(label = word, size = speakers, color = name)) +
geom_text_wordcloud() +
scale_size_area(max_size = 20) +
theme_minimal()Color Gradient
Finally, if you want to create a gradient you can pass a numerical variable to the color argument of aes and use a color scale such as scale_color_gradient to create the gradient color scale.
# install.packages("ggwordcloud")
library(ggwordcloud)
# Data
df <- thankyou_words_small
set.seed(1)
ggplot(df, aes(label = word, size = speakers, color = speakers)) +
geom_text_wordcloud() +
scale_size_area(max_size = 20) +
theme_minimal() +
scale_color_gradient(low = "darkred", high = "red")In your teams please attempt questions 1 and 2 below.
Possible Process Indicators - Input : Sound and Video Quality, Talents - Activity : Duration of the Video - Output : No of Views
Possible Graph Types - Boxplot differentiated by demographic factor - Density Plot differentiated by demographic factor
We will generate a boxplot to show the differentiated ratings by gender for event logistics (field : Process_E_LS) using dplyr pipes.
Example of a boxplot showing differentiated ratings for event logistics by gender
C2Cagg%>%
ggplot(aes(x=GENDER, y=Process_E_LS)) +
geom_boxplot(color="red", fill="green", alpha=0.2) +
ggtitle("Process Evaluation for Logistics by Gender") + ylim(0,7)Example of a grid plot showing three histograms with distribution of evaluation for logistics, instructors and programme
#load the gridExtra package
library(gridExtra)
#plot the first histogram in column 1 without dplyr pipes
Histogram01 <- ggplot(C2Cagg, aes(x=Process_E_LS)) +
geom_histogram(binwidth=1, fill="green", color="black", alpha=0.9) +
ggtitle("Logistics") + ylim(0,1000)
#plot the second histogram in column 2 without dplyr pipes
Histogram02 <- ggplot(C2Cagg, aes(x=Process_E_EX)) +
geom_histogram(binwidth=1, fill="blue", color="black", alpha=0.9) +
ggtitle("Instructors") + ylim(0,1000)
#plot the third histogram in column 3 without dplyr pipes
Histogram03 <- ggplot(C2Cagg, aes(x=Process_E_CT)) +
geom_histogram(binwidth=1, fill="orange", color="black", alpha=0.9) +
ggtitle("Programme") + ylim(0,1000)
#generate the multi-faceted plot using the grid.arrange function
grid.arrange(Histogram01, Histogram02, Histogram03, ncol = 3)by Hadley Wickham
online version : https://ggplot2-book.org/
citation : Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. (3rd ed). Routledge
by Winston Chang
online version : https://r-graphics.org/
citation : Chang, W. (2022). R Graphics Cookbook. (2nd ed). O’Reilly