Data Analytics Seminar

02 Data Visualisation with ggplot2

Class on 28 August 2021

Data Visualisation with ggplot2

ggplot2 is a R package from the Tidyverse family which is dedicated to data visualization. In this session, you will revise some of what we have covered in Lesson 08 (e-Learning) and learn how to create the following plots which can help you communicate your data and findings better, in preparation for your CWF report.

These are the types of plots which we will cover in this session.

Histograms
Boxplots
Density Plots
Ridgeline Plots
Violin Plots
Heatmaps

Enjoy !

Loading the Data and Packages

This is the code for installation of Pacman which is used to load all packages for this section. You have used it in Section 01 too.

install.packages("pacman",repos = "http://cran.us.r-project.org")

## package 'pacman' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\aaron_chen_angus\AppData\Local\Temp\RtmpSaj2TB\downloaded_packages

Load the packages required for this section

pacman: for loading/unloading packages
psych: for psychometric functions
rio: for importing data
tidyverse: for data wrangling and visualisation functions
ggplot2: for the use of ggplot2
ggridges: for the generation of ridgeline plots
vioplot: for the generation of violin plots

pacman::p_load(pacman, psych, rio, tidyverse, ggplot2, ggridges, devtools, vioplot, dplyr)

## Error in get(genname, envir = envir) : object 'testthat_print' not found

You can now import the source data from the csv file which I have placed online at Github via the link below using the read.csv command.

https://raw.githubusercontent.com/aaron-chen-angus/community2campus/main/C2CsurveyAggregated.csv

C2Cagg <- read.csv(file = "https://raw.githubusercontent.com/aaron-chen-angus/community2campus/main/C2CsurveyAggregated.csv", header = TRUE, sep = ",")

Check on the output by reading the column names

C2Cagg %>% colnames()

##   [1] "UNQ_ID"        "AGE.RANGE"     "GENDER"        "NCSS_B03A"    
##   [5] "NCSS_B03B"     "NCSS_B04"      "NCSS_B05"      "EVENT"        
##   [9] "PARTICIPATION" "PART_ROLE"     "AFFILIATION"   "Process_E_LS" 
##  [13] "Process_E_EX"  "Process_E_CT"  "Pre_MR1"       "Post_MR1"     
##  [17] "X_MR1"         "Pre_MR2"       "Post_MR2"      "X_MR2"        
##  [21] "Pre_MR3"       "Post_MR3"      "X_MR3"         "Pre_MR4"      
##  [25] "Post_MR4"      "X_MR4"         "Pre_MR5"       "Post_MR5"     
##  [29] "X_MR5"         "Pre_AT_PT1"    "Pre_AT_PT2"    "Pre_AT_PT3"   
##  [33] "Pre_AT_WT5"    "Pre_AT_CT1"    "Pre_AT_CT2"    "Pre_AT_CT3"   
##  [37] "Pre_AT_CT4"    "Pre_AT_LA1"    "Pre_AT_LA2"    "Pre_AT_LA3"   
##  [41] "Pre_AT_SC6"    "Pre_ST_SD1"    "Pre_ST_SD2"    "Pre_ST_SD3"   
##  [45] "Pre_ST_SD4"    "Pre_NCSS_P2"   "Pre_AT_WT2"    "Pre_AT_WT3"   
##  [49] "Pre_AT_WT4"    "Pre_AT_LA4"    "Pre_AT_WT1"    "Pre_AT_CT5"   
##  [53] "Pre_AT_LA5"    "Pre_AT_SC2"    "Pre_ST_SD5"    "Pre_ST_SR2"   
##  [57] "Pre_ST_SR1"    "Pre_ST_SR3"    "Pre_ST_SR4"    "Pre_ST_SR5"   
##  [61] "Pre_NCSS_P1"   "Pre_NCSS_P3"   "Pre_NCSS_P4"   "Pre_NCSS_P5"  
##  [65] "Pre_AT_SC1"    "Pre_AT_SC3"    "Pre_AT_SC4"    "Pre_AT_SC5"   
##  [69] "Post_AT_PT1"   "Post_AT_PT2"   "Post_AT_PT3"   "Post_AT_WT5"  
##  [73] "Post_AT_CT1"   "Post_AT_CT2"   "Post_AT_CT3"   "Post_AT_CT4"  
##  [77] "Post_AT_LA1"   "Post_AT_LA2"   "Post_AT_LA3"   "Post_ST_SD1"  
##  [81] "Post_ST_SD2"   "Post_ST_SD3"   "Post_ST_SD4"   "Post_AT_SC6"  
##  [85] "Post_NCSS_P2"  "Post_AT_WT2"   "Post_AT_WT3"   "Post_AT_WT4"  
##  [89] "Post_AT_LA4"   "Post_AT_WT1"   "Post_AT_CT5"   "Post_AT_LA5"  
##  [93] "Post_AT_SC2"   "Post_ST_SD5"   "Post_ST_SR2"   "Post_ST_SR1"  
##  [97] "Post_ST_SR3"   "Post_ST_SR4"   "Post_ST_SR5"   "Post_NCSS_P1" 
## [101] "Post_NCSS_P3"  "Post_NCSS_P4"  "Post_NCSS_P5"  "Post_AT_SC1"  
## [105] "Post_AT_SC3"   "Post_AT_SC4"   "Post_AT_SC5"

Recap from Lesson 08 - Histograms for Process Evaluation

We will generate a histogram to show the distribution of ratings for event logistics (field : Process_E_LS), using the new code structure which is based on dplyr pipes.

C2Cagg%>%
ggplot(aes(x=Process_E_LS)) + 
geom_histogram(binwidth=1, fill="green", color="black", alpha=0.9) +
ggtitle("Process Evaluation for Logistics") + ylim(0,1000)

Next, we will generate a histogram to show the distribution of ratings for instructors (field : Process_E_EX).

C2Cagg%>%
ggplot(aes(x=Process_E_EX)) + 
geom_histogram(binwidth=1, fill="blue", color="black", alpha=0.9) +
ggtitle("Process Evaluation for Instructors") + ylim(0,1000)

Next, we will generate a histogram to show the distribution of ratings for programme activities (field : Process_E_CT).

C2Cagg%>%
ggplot(aes(x=Process_E_CT)) + 
geom_histogram(binwidth=1, fill="orange", color="black", alpha=0.9) +
ggtitle("Process Evaluation for Programme") + ylim(0,1000)

Question for Review : What is the difference between the code used here to generate the histograms and what was used in Lesson 08 ?

Recap from Lesson 08 - Boxplots for Differentiated Process Evaluation

We will generate a boxplot to show the differentiated ratings by gender for event logistics (field : Process_E_LS), this time using dplyr pipes instead of the codes used in Lesson 08.

C2Cagg%>%
ggplot(aes(x=GENDER, y=Process_E_LS)) + 
geom_boxplot(color="red", fill="green", alpha=0.2) + 
ggtitle("Process Evaluation for Logistics by Gender") + ylim(0,7)

We will now generate a boxplot to show the differentiated ratings by gender for instructors (field : Process_E_EX).

C2Cagg%>%
ggplot(aes(x=GENDER, y=Process_E_EX)) + 
geom_boxplot(color="red", fill="blue", alpha=0.2) + 
ggtitle("Process Evaluation for Instructors by Gender") + ylim(0,7)

We will now generate a boxplot to show the differentiated ratings by gender for programmes (field : Process_E_CT).

C2Cagg%>%
ggplot(aes(x=GENDER, y=Process_E_CT)) + 
geom_boxplot(color="red", fill="orange", alpha=0.2) + 
ggtitle("Process Evaluation for Programme by Gender") + ylim(0,7)

Question for Review : What is the difference between the code used here to generate the boxplots and what was used in Lesson 08 ?

Recap from Lesson 08 - Boxplots for Differentiated Process Evaluation

We will generate a boxplot to show the differentiated ratings for event logistics by event (field : Process_E_LS), but using dplyr pipes this time.

C2Cagg%>%
ggplot(aes(x=EVENT, y=Process_E_LS)) + 
geom_boxplot(color="red", fill="green", alpha=0.2) + 
ggtitle("Process Evaluation for Logistics by Event") + 
ylim(0,6) + coord_flip()

We will now generate a boxplot to show the differentiated ratings for event staff by event (field : Process_E_EX)

C2Cagg%>%
ggplot(aes(x=EVENT, y=Process_E_EX)) + 
geom_boxplot(color="red", fill="blue", alpha=0.2) + 
ggtitle("Process Evaluation for Instructors by Event") + 
ylim(0,6) + coord_flip()

We will now generate a boxplot to show the differentiated ratings for programme by event (field : Process_E_CT)

C2Cagg%>%
ggplot(aes(x=EVENT, y=Process_E_CT)) + 
geom_boxplot(color="red", fill="orange", alpha=0.2) + 
ggtitle("Process Evaluation for Instructors by Event") + 
ylim(0,6) + coord_flip()

Question for Review : Once again, please ponder upon the difference between the code used here to generate the boxplots and what was used in Lesson 08.

Boxplots for Aggregated Factors Compared Across Gender

Comparison of Gender in MR1 Aggregated Factor Impacts

C2Cagg%>%
ggplot(aes(x=GENDER, y=X_MR1, fill=GENDER)) + 
geom_boxplot() + 
ggtitle("Gender x MR1 Gains") + ylim(-3,3)

## Warning: Removed 16 rows containing non-finite values (stat_boxplot).

Comparison of Gender in MR2 Aggregated Factor Impacts

C2Cagg%>%
ggplot(aes(x=GENDER, y=X_MR2, fill=GENDER)) + 
geom_boxplot() + 
ggtitle("Gender x MR2 Gains") + ylim(-3,3)

Comparison of Gender in MR3 Aggregated Factor Impacts

C2Cagg%>%
ggplot(aes(x=GENDER, y=X_MR3, fill=GENDER)) + 
geom_boxplot() + 
ggtitle("Gender x MR3 Gains") + ylim(-3,3)

## Warning: Removed 49 rows containing non-finite values (stat_boxplot).

Comparison of Gender in MR4 Aggregated Factor Impacts

C2Cagg%>%
ggplot(aes(x=GENDER, y=X_MR4, fill=GENDER)) + 
geom_boxplot() + 
ggtitle("Gender x MR4 Gains") + ylim(-3,3)

Comparison of Gender in MR5 Aggregated Factor Impacts

C2Cagg%>%
ggplot(aes(x=GENDER, y=X_MR5, fill=GENDER)) + 
geom_boxplot() + 
ggtitle("Gender x MR5 Gains") + ylim(-3,3)

## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

Question for Review : What kind of data visualisations or representations would boxplots be most suited for ?

Density Plots for Aggregated Factors Compared Across Experience

The following density plots are constructed to ascertain the impact of knowing someone with a mental health condition (field : NCSS_B03A), on the impact of specific aggregated factors.

Density Plot for NCSS_B03A and MR1

C2Cagg%>%
ggplot(aes(x=X_MR1, color=NCSS_B03A, fill=NCSS_B03A)) +
geom_density(alpha=0.3,size=1)+ 
labs(x= "Change in MR1 Aggregated Scores Following Event",
subtitle="",
caption="Kruskal-Wallis chi-squared = 148.25, df = 70, p-value = 1.495e-07")

Density Plot for NCSS_B03A and MR2

C2Cagg%>%
ggplot(aes(x=X_MR2, color=NCSS_B03A, fill=NCSS_B03A)) +
geom_density(alpha=0.3,size=1)+ 
labs(x= "Change in MR2 Aggregated Scores Following Event",
subtitle="",
caption="Kruskal-Wallis chi-squared = 45.786, df = 15, p-value = 5.749e-05")

Density Plot for NCSS_B03A and MR3

C2Cagg%>%
ggplot(aes(x=X_MR3, color=NCSS_B03A, fill=NCSS_B03A)) +
geom_density(alpha=0.3,size=1)+ 
labs(x= "Change in MR3 Aggregated Scores Following Event",
subtitle="",
caption="Kruskal-Wallis chi-squared = 125.34, df = 50, p-value = 2.1e-08")

Density Plot for NCSS_B03A and MR4

C2Cagg%>%
ggplot(aes(x=X_MR4, color=NCSS_B03A, fill=NCSS_B03A)) +
geom_density(alpha=0.3,size=1)+ 
labs(x= "Change in MR4 Aggregated Scores Following Event",
subtitle="",
caption="Kruskal-Wallis chi-squared = 44.57, df = 16, p-value = 0.0001615")

Density Plot for NCSS_B03A and MR5

C2Cagg%>%
ggplot(aes(x=X_MR5, color=NCSS_B03A, fill=NCSS_B03A)) +
geom_density(alpha=0.3,size=1)+ 
labs(x= "Change in MR5 Aggregated Scores Following Event",
subtitle="",
caption="Kruskal-Wallis chi-squared = 42.013, df = 20, p-value = 0.002755")

Question for Review : What kind of data visualisations or representations would density plots be most suited for ?

Ridgeline Plots for Aggregated Factors Compared Across Education Level

The following ridgeline plots are constructed to ascertain the impact of participants’ education level (field : NCSS_B05A), on the impact of specific aggregated factors.

Density Plot for NCSS_B05A and MR1

C2Cagg%>%
ggplot(aes(x = X_MR1, y = NCSS_B05, fill = NCSS_B05)) +
geom_density_ridges() +
theme_ridges() + 
theme(legend.position = "none")

## Picking joint bandwidth of 0.355

Density Plot for NCSS_B05A and MR2

C2Cagg%>%
ggplot(aes(x = X_MR2, y = NCSS_B05, fill = NCSS_B05)) +
geom_density_ridges() +
theme_ridges() + 
theme(legend.position = "none")

## Picking joint bandwidth of 0.211

Density Plot for NCSS_B05A and MR3

C2Cagg%>%
ggplot(aes(x = X_MR3, y = NCSS_B05, fill = NCSS_B05)) +
geom_density_ridges() +
theme_ridges() + 
theme(legend.position = "none")

## Picking joint bandwidth of 0.393

Density Plot for NCSS_B05A and MR4

C2Cagg%>%
ggplot(aes(x = X_MR4, y = NCSS_B05, fill = NCSS_B05)) +
geom_density_ridges() +
theme_ridges() + 
theme(legend.position = "none")

## Picking joint bandwidth of 0.233

Density Plot for NCSS_B05A and MR5

C2Cagg%>%
ggplot(aes(x = X_MR5, y = NCSS_B05, fill = NCSS_B05)) +
geom_density_ridges() +
theme_ridges() + 
theme(legend.position = "none")

## Picking joint bandwidth of 0.259

Question for Review : What kind of data visualisations or representations would ridgeline plots be most suited for ?

Violin Plots for Aggregated Factors Compared Across Engagement Level

The following violin plots are constructed to ascertain the impact of participants’ engagement level (field : PARTICIPATION), on the impact of specific aggregated factors.

First, we will need to ensure the PARTICIPATION field is a factor.

C2Cagg$PARTICIPATION <- as.factor(C2Cagg$PARTICIPATION)

Violin Plot for PARTICIPATION and MR1

C2Cagg%>%
ggplot(aes(x=PARTICIPATION, 
           y=X_MR1, 
           fill=PARTICIPATION)) + 
  geom_violin()

Violin Plot for PARTICIPATION and MR2

C2Cagg%>%
ggplot(aes(x=PARTICIPATION, 
           y=X_MR2, 
           fill=PARTICIPATION)) + 
  geom_violin()

Violin Plot for PARTICIPATION and MR3

C2Cagg%>%
ggplot(aes(x=PARTICIPATION, 
           y=X_MR3, 
           fill=PARTICIPATION)) + 
geom_violin()

Violin Plot for PARTICIPATION and MR4

C2Cagg%>%
ggplot(aes(x=PARTICIPATION, 
           y=X_MR4, 
           fill=PARTICIPATION)) + 
geom_violin()

Violin Plot for PARTICIPATION and MR5

C2Cagg%>%
ggplot(aes(x=PARTICIPATION, 
           y=X_MR5, 
           fill=PARTICIPATION)) + 
geom_violin()

Question for Review : What kind of data visualisations or representations would violin plots be most suited for ?

Building a Heatmap with a Data Subset

We will build a simple heatmap to show the impact of various events on the aggregated factors.

Impact on MR1 Compared Across Various Events

C2Cagg%>%
ggplot(aes(x=X_MR1, y=EVENT, fill = X_MR1)) + 
geom_tile() + 
xlab(label = "Impact on MR1 : Addressing Stigma") +
scale_fill_gradient(name = "MR1 Shift",
                      low = "#FFFFFF",
                      high = "#012345")

Impact on MR2 Compared Across Various Events

C2Cagg%>%
ggplot(aes(x=X_MR2, y=EVENT, fill = X_MR2)) + 
geom_tile() + 
xlab(label = "Impact on MR2 : Mental Health Literacy : Wishful Thinking") +
scale_fill_gradient(name = "MR2 Shift",
                      low = "#FFFFFF",
                      high = "#012345")

Impact on MR3 Compared Across Various Events

C2Cagg%>%
ggplot(aes(x=X_MR3, y=EVENT, fill = X_MR3)) + 
geom_tile() + 
xlab(label = "Impact on MR3 : Promoting Mental Health Advocacy") +
scale_fill_gradient(name = "MR3 Shift",
                      low = "#FFFFFF",
                      high = "#012345")

Impact on MR4 Compared Across Various Events

C2Cagg%>%
ggplot(aes(x=X_MR4, y=EVENT, fill = X_MR4)) + 
geom_tile() + 
xlab(label = "Impact on MR4 : Mental Health Literacy : Relationships") +
scale_fill_gradient(name = "MR4 Shift",
                      low = "#FFFFFF",
                      high = "#012345")

Impact on MR5 Compared Across Various Events

C2Cagg%>%
ggplot(aes(x=X_MR5, y=EVENT, fill = X_MR5)) + 
geom_tile() + 
xlab(label = "Impact on MR5 : Mental Health Literacy : Social Constructivism") +
scale_fill_gradient(name = "MR5 Shift",
                      low = "#FFFFFF",
                      high = "#012345")

Congratulations !

You have completed session 2 of the S3729C Data Analytics Seminar.