02 Data Visualisation with ggplot2
Class on 28 August 2021
ggplot2 is a R package from the Tidyverse family which is dedicated to data visualization. In this session, you will revise some of what we have covered in Lesson 08 (e-Learning) and learn how to create the following plots which can help you communicate your data and findings better, in preparation for your CWF report.
These are the types of plots which we will cover in this session.
Enjoy !
This is the code for installation of Pacman which is used to load all packages for this section. You have used it in Section 01 too.
## package 'pacman' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\aaron_chen_angus\AppData\Local\Temp\RtmpSaj2TB\downloaded_packages
Load the packages required for this section
## Error in get(genname, envir = envir) : object 'testthat_print' not found
You can now import the source data from the csv file which I have placed online at Github via the link below using the read.csv command.
https://raw.githubusercontent.com/aaron-chen-angus/community2campus/main/C2CsurveyAggregated.csv
Check on the output by reading the column names
## [1] "UNQ_ID" "AGE.RANGE" "GENDER" "NCSS_B03A"
## [5] "NCSS_B03B" "NCSS_B04" "NCSS_B05" "EVENT"
## [9] "PARTICIPATION" "PART_ROLE" "AFFILIATION" "Process_E_LS"
## [13] "Process_E_EX" "Process_E_CT" "Pre_MR1" "Post_MR1"
## [17] "X_MR1" "Pre_MR2" "Post_MR2" "X_MR2"
## [21] "Pre_MR3" "Post_MR3" "X_MR3" "Pre_MR4"
## [25] "Post_MR4" "X_MR4" "Pre_MR5" "Post_MR5"
## [29] "X_MR5" "Pre_AT_PT1" "Pre_AT_PT2" "Pre_AT_PT3"
## [33] "Pre_AT_WT5" "Pre_AT_CT1" "Pre_AT_CT2" "Pre_AT_CT3"
## [37] "Pre_AT_CT4" "Pre_AT_LA1" "Pre_AT_LA2" "Pre_AT_LA3"
## [41] "Pre_AT_SC6" "Pre_ST_SD1" "Pre_ST_SD2" "Pre_ST_SD3"
## [45] "Pre_ST_SD4" "Pre_NCSS_P2" "Pre_AT_WT2" "Pre_AT_WT3"
## [49] "Pre_AT_WT4" "Pre_AT_LA4" "Pre_AT_WT1" "Pre_AT_CT5"
## [53] "Pre_AT_LA5" "Pre_AT_SC2" "Pre_ST_SD5" "Pre_ST_SR2"
## [57] "Pre_ST_SR1" "Pre_ST_SR3" "Pre_ST_SR4" "Pre_ST_SR5"
## [61] "Pre_NCSS_P1" "Pre_NCSS_P3" "Pre_NCSS_P4" "Pre_NCSS_P5"
## [65] "Pre_AT_SC1" "Pre_AT_SC3" "Pre_AT_SC4" "Pre_AT_SC5"
## [69] "Post_AT_PT1" "Post_AT_PT2" "Post_AT_PT3" "Post_AT_WT5"
## [73] "Post_AT_CT1" "Post_AT_CT2" "Post_AT_CT3" "Post_AT_CT4"
## [77] "Post_AT_LA1" "Post_AT_LA2" "Post_AT_LA3" "Post_ST_SD1"
## [81] "Post_ST_SD2" "Post_ST_SD3" "Post_ST_SD4" "Post_AT_SC6"
## [85] "Post_NCSS_P2" "Post_AT_WT2" "Post_AT_WT3" "Post_AT_WT4"
## [89] "Post_AT_LA4" "Post_AT_WT1" "Post_AT_CT5" "Post_AT_LA5"
## [93] "Post_AT_SC2" "Post_ST_SD5" "Post_ST_SR2" "Post_ST_SR1"
## [97] "Post_ST_SR3" "Post_ST_SR4" "Post_ST_SR5" "Post_NCSS_P1"
## [101] "Post_NCSS_P3" "Post_NCSS_P4" "Post_NCSS_P5" "Post_AT_SC1"
## [105] "Post_AT_SC3" "Post_AT_SC4" "Post_AT_SC5"
We will generate a histogram to show the distribution of ratings for event logistics (field : Process_E_LS), using the new code structure which is based on dplyr pipes.
C2Cagg%>%
ggplot(aes(x=Process_E_LS)) +
geom_histogram(binwidth=1, fill="green", color="black", alpha=0.9) +
ggtitle("Process Evaluation for Logistics") + ylim(0,1000)Next, we will generate a histogram to show the distribution of ratings for instructors (field : Process_E_EX).
C2Cagg%>%
ggplot(aes(x=Process_E_EX)) +
geom_histogram(binwidth=1, fill="blue", color="black", alpha=0.9) +
ggtitle("Process Evaluation for Instructors") + ylim(0,1000)Next, we will generate a histogram to show the distribution of ratings for programme activities (field : Process_E_CT).
C2Cagg%>%
ggplot(aes(x=Process_E_CT)) +
geom_histogram(binwidth=1, fill="orange", color="black", alpha=0.9) +
ggtitle("Process Evaluation for Programme") + ylim(0,1000)Question for Review : What is the difference between the code used here to generate the histograms and what was used in Lesson 08 ?
We will generate a boxplot to show the differentiated ratings by gender for event logistics (field : Process_E_LS), this time using dplyr pipes instead of the codes used in Lesson 08.
C2Cagg%>%
ggplot(aes(x=GENDER, y=Process_E_LS)) +
geom_boxplot(color="red", fill="green", alpha=0.2) +
ggtitle("Process Evaluation for Logistics by Gender") + ylim(0,7)We will now generate a boxplot to show the differentiated ratings by gender for instructors (field : Process_E_EX).
C2Cagg%>%
ggplot(aes(x=GENDER, y=Process_E_EX)) +
geom_boxplot(color="red", fill="blue", alpha=0.2) +
ggtitle("Process Evaluation for Instructors by Gender") + ylim(0,7)We will now generate a boxplot to show the differentiated ratings by gender for programmes (field : Process_E_CT).
C2Cagg%>%
ggplot(aes(x=GENDER, y=Process_E_CT)) +
geom_boxplot(color="red", fill="orange", alpha=0.2) +
ggtitle("Process Evaluation for Programme by Gender") + ylim(0,7)Question for Review : What is the difference between the code used here to generate the boxplots and what was used in Lesson 08 ?
We will generate a boxplot to show the differentiated ratings for event logistics by event (field : Process_E_LS), but using dplyr pipes this time.
C2Cagg%>%
ggplot(aes(x=EVENT, y=Process_E_LS)) +
geom_boxplot(color="red", fill="green", alpha=0.2) +
ggtitle("Process Evaluation for Logistics by Event") +
ylim(0,6) + coord_flip()We will now generate a boxplot to show the differentiated ratings for event staff by event (field : Process_E_EX)
C2Cagg%>%
ggplot(aes(x=EVENT, y=Process_E_EX)) +
geom_boxplot(color="red", fill="blue", alpha=0.2) +
ggtitle("Process Evaluation for Instructors by Event") +
ylim(0,6) + coord_flip()We will now generate a boxplot to show the differentiated ratings for programme by event (field : Process_E_CT)
C2Cagg%>%
ggplot(aes(x=EVENT, y=Process_E_CT)) +
geom_boxplot(color="red", fill="orange", alpha=0.2) +
ggtitle("Process Evaluation for Instructors by Event") +
ylim(0,6) + coord_flip()Question for Review : Once again, please ponder upon the difference between the code used here to generate the boxplots and what was used in Lesson 08.
Comparison of Gender in MR1 Aggregated Factor Impacts
C2Cagg%>%
ggplot(aes(x=GENDER, y=X_MR1, fill=GENDER)) +
geom_boxplot() +
ggtitle("Gender x MR1 Gains") + ylim(-3,3)## Warning: Removed 16 rows containing non-finite values (stat_boxplot).
Comparison of Gender in MR2 Aggregated Factor Impacts
C2Cagg%>%
ggplot(aes(x=GENDER, y=X_MR2, fill=GENDER)) +
geom_boxplot() +
ggtitle("Gender x MR2 Gains") + ylim(-3,3)Comparison of Gender in MR3 Aggregated Factor Impacts
C2Cagg%>%
ggplot(aes(x=GENDER, y=X_MR3, fill=GENDER)) +
geom_boxplot() +
ggtitle("Gender x MR3 Gains") + ylim(-3,3)## Warning: Removed 49 rows containing non-finite values (stat_boxplot).
Comparison of Gender in MR4 Aggregated Factor Impacts
C2Cagg%>%
ggplot(aes(x=GENDER, y=X_MR4, fill=GENDER)) +
geom_boxplot() +
ggtitle("Gender x MR4 Gains") + ylim(-3,3)Comparison of Gender in MR5 Aggregated Factor Impacts
C2Cagg%>%
ggplot(aes(x=GENDER, y=X_MR5, fill=GENDER)) +
geom_boxplot() +
ggtitle("Gender x MR5 Gains") + ylim(-3,3)## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
Question for Review : What kind of data visualisations or representations would boxplots be most suited for ?
The following density plots are constructed to ascertain the impact of knowing someone with a mental health condition (field : NCSS_B03A), on the impact of specific aggregated factors.
Density Plot for NCSS_B03A and MR1
C2Cagg%>%
ggplot(aes(x=X_MR1, color=NCSS_B03A, fill=NCSS_B03A)) +
geom_density(alpha=0.3,size=1)+
labs(x= "Change in MR1 Aggregated Scores Following Event",
subtitle="",
caption="Kruskal-Wallis chi-squared = 148.25, df = 70, p-value = 1.495e-07")Density Plot for NCSS_B03A and MR2
C2Cagg%>%
ggplot(aes(x=X_MR2, color=NCSS_B03A, fill=NCSS_B03A)) +
geom_density(alpha=0.3,size=1)+
labs(x= "Change in MR2 Aggregated Scores Following Event",
subtitle="",
caption="Kruskal-Wallis chi-squared = 45.786, df = 15, p-value = 5.749e-05")Density Plot for NCSS_B03A and MR3
C2Cagg%>%
ggplot(aes(x=X_MR3, color=NCSS_B03A, fill=NCSS_B03A)) +
geom_density(alpha=0.3,size=1)+
labs(x= "Change in MR3 Aggregated Scores Following Event",
subtitle="",
caption="Kruskal-Wallis chi-squared = 125.34, df = 50, p-value = 2.1e-08")Density Plot for NCSS_B03A and MR4
C2Cagg%>%
ggplot(aes(x=X_MR4, color=NCSS_B03A, fill=NCSS_B03A)) +
geom_density(alpha=0.3,size=1)+
labs(x= "Change in MR4 Aggregated Scores Following Event",
subtitle="",
caption="Kruskal-Wallis chi-squared = 44.57, df = 16, p-value = 0.0001615")Density Plot for NCSS_B03A and MR5
C2Cagg%>%
ggplot(aes(x=X_MR5, color=NCSS_B03A, fill=NCSS_B03A)) +
geom_density(alpha=0.3,size=1)+
labs(x= "Change in MR5 Aggregated Scores Following Event",
subtitle="",
caption="Kruskal-Wallis chi-squared = 42.013, df = 20, p-value = 0.002755")Question for Review : What kind of data visualisations or representations would density plots be most suited for ?
The following ridgeline plots are constructed to ascertain the impact of participants’ education level (field : NCSS_B05A), on the impact of specific aggregated factors.
Density Plot for NCSS_B05A and MR1
C2Cagg%>%
ggplot(aes(x = X_MR1, y = NCSS_B05, fill = NCSS_B05)) +
geom_density_ridges() +
theme_ridges() +
theme(legend.position = "none")## Picking joint bandwidth of 0.355
Density Plot for NCSS_B05A and MR2
C2Cagg%>%
ggplot(aes(x = X_MR2, y = NCSS_B05, fill = NCSS_B05)) +
geom_density_ridges() +
theme_ridges() +
theme(legend.position = "none")## Picking joint bandwidth of 0.211
Density Plot for NCSS_B05A and MR3
C2Cagg%>%
ggplot(aes(x = X_MR3, y = NCSS_B05, fill = NCSS_B05)) +
geom_density_ridges() +
theme_ridges() +
theme(legend.position = "none")## Picking joint bandwidth of 0.393
Density Plot for NCSS_B05A and MR4
C2Cagg%>%
ggplot(aes(x = X_MR4, y = NCSS_B05, fill = NCSS_B05)) +
geom_density_ridges() +
theme_ridges() +
theme(legend.position = "none")## Picking joint bandwidth of 0.233
Density Plot for NCSS_B05A and MR5
C2Cagg%>%
ggplot(aes(x = X_MR5, y = NCSS_B05, fill = NCSS_B05)) +
geom_density_ridges() +
theme_ridges() +
theme(legend.position = "none")## Picking joint bandwidth of 0.259
Question for Review : What kind of data visualisations or representations would ridgeline plots be most suited for ?
The following violin plots are constructed to ascertain the impact of participants’ engagement level (field : PARTICIPATION), on the impact of specific aggregated factors.
First, we will need to ensure the PARTICIPATION field is a factor.
Violin Plot for PARTICIPATION and MR1
Violin Plot for PARTICIPATION and MR2
Violin Plot for PARTICIPATION and MR3
Violin Plot for PARTICIPATION and MR4
Violin Plot for PARTICIPATION and MR5
Question for Review : What kind of data visualisations or representations would violin plots be most suited for ?
We will build a simple heatmap to show the impact of various events on the aggregated factors.
Impact on MR1 Compared Across Various Events
C2Cagg%>%
ggplot(aes(x=X_MR1, y=EVENT, fill = X_MR1)) +
geom_tile() +
xlab(label = "Impact on MR1 : Addressing Stigma") +
scale_fill_gradient(name = "MR1 Shift",
low = "#FFFFFF",
high = "#012345")Impact on MR2 Compared Across Various Events
C2Cagg%>%
ggplot(aes(x=X_MR2, y=EVENT, fill = X_MR2)) +
geom_tile() +
xlab(label = "Impact on MR2 : Mental Health Literacy : Wishful Thinking") +
scale_fill_gradient(name = "MR2 Shift",
low = "#FFFFFF",
high = "#012345")Impact on MR3 Compared Across Various Events
C2Cagg%>%
ggplot(aes(x=X_MR3, y=EVENT, fill = X_MR3)) +
geom_tile() +
xlab(label = "Impact on MR3 : Promoting Mental Health Advocacy") +
scale_fill_gradient(name = "MR3 Shift",
low = "#FFFFFF",
high = "#012345") Impact on MR4 Compared Across Various Events
C2Cagg%>%
ggplot(aes(x=X_MR4, y=EVENT, fill = X_MR4)) +
geom_tile() +
xlab(label = "Impact on MR4 : Mental Health Literacy : Relationships") +
scale_fill_gradient(name = "MR4 Shift",
low = "#FFFFFF",
high = "#012345")Impact on MR5 Compared Across Various Events
C2Cagg%>%
ggplot(aes(x=X_MR5, y=EVENT, fill = X_MR5)) +
geom_tile() +
xlab(label = "Impact on MR5 : Mental Health Literacy : Social Constructivism") +
scale_fill_gradient(name = "MR5 Shift",
low = "#FFFFFF",
high = "#012345")Congratulations !
You have completed session 2 of the S3729C Data Analytics Seminar.