SHL Journal Club 06 July 2022

Principal Component Analysis and Predictive Analytics on a Youth-Focused Community Mental Health Project

Aaron Chen Angus

CommunityᄅCampus [ C2C ]

An Analysis of the Effectiveness of Youth Engagement Programmes to Combat Mental Health Stigma

This study assesses the efficacy of “blended” programmes conducted during restrictions to “live” programmes following infection control measures implemented amidst the COVID-19 pandemic.

Engagement touch points (events) were established for youth in areas of Education, Performing Arts, Sports & Fitness, and key indicators pertaining to effects on mental health literacy, stigma and advocacy were measured.

This included events such as :

Kindred Spirit Series 2020
- n(p) = 750 | n(r) = 252
NBFA Sports Model 2020
- n(p) = 150 | n(r) = 51
Bagan Int’l Race 2020
- n(p) = 800 | n(r) = 283
The Next Hitmaker 2020
- n(p) = 50 | n(r) = 6
Off Centre | WAWW 2020
- n(p) = 1700 | n(r) = 370
Social Media Invasion
- n(p) = 1000 | n(r) = 640
n(p) = total number of participants n(r) = number of survey respondents

All participants had to complete surveys conducted upon enrolment and following programme conclusion.

The C2C survey was constructed based on sociodemographic correlates and components of the Attitudes Towards Serious Mental Illness [ATSMI-AV] and Social Tolerance [ST] scales. Principal component analysis [PCA], cluster analysis and multiple regression models were used to elucidate its factor structure.

Responses on pre- and post- surveys were compared via Mann-Whitney-Wilcoxon tests to ascertain the magnitude and direction of impact, and differences between programme effects investigated via Kruskal–Wallis H test. MANCOVA was used to investigate the effect of objective engagement scores among the different programmes on the factors identified by PCA.

Principal Component Analysis

This first section will bring you through the process of Principal Component Analysis (PCA), carried out on the CommunityᄅCampus base dataset.

For more information about PCA, I recommend the following resources for a good read

R Blog on PCA https://www.r-bloggers.com/2021/05/principal-component-analysis-pca-in-r/
DataCamp Short Course on PCA https://www.datacamp.com/community/tutorials/pca-analysis-r

This section on PCA is meant to provide to the SHL Journal Club an appreciation of the process that goes into conducting exploratory data analysis, and better visualise the variation present dataset with many variables, or “wide” variables as we refer to it in the field.

This is especially pertinent when you have “wide” datasets such as the CommunityᄅCampus dataset with multiple variables and we would need to look at dimensionality reduction to provide more efficient data analysis.

Loading the Data and Packages

This is the code for installation of Pacman which is used to load all packages later.

install.packages("pacman",repos = "http://cran.us.r-project.org")

## package 'pacman' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\aaron_chen_angus\AppData\Local\Temp\RtmpYNEbBU\downloaded_packages

The following are the R packages that are provided for this section on Principal Component Analysis

GPArotation: Gradient Projection Algorithm Rotation for FA
magrittr: for pipes
pacman: for loading/unloading packages
psych: for psychometric functions
rio: for importing data
tidyverse: for a wide range of functions used in PCA and so on
lavaan: for confirmatory factor analysis

pacman::p_load(GPArotation, magrittr, pacman, psych, rio, tidyverse, lavaan)

The PCA data is placed in a csv file on Github via the link https://raw.githubusercontent.com/aaron-chen-angus/community2campus/main/PCArevdata.csv

It can be accessed using the read.csv command.

df <- read.csv(file = "https://raw.githubusercontent.com/aaron-chen-angus/community2campus/main/PCArevdata.csv", header = TRUE, sep = ",")

Check on the output by reading the column names

df %>% colnames()

##  [1] "NCSS_P1" "NCSS_P2" "NCSS_P3" "NCSS_P4" "NCSS_P5" "AT_PT1"  "AT_PT2" 
##  [8] "AT_PT3"  "AT_WT1"  "AT_WT2"  "AT_WT3"  "AT_WT4"  "AT_WT5"  "AT_CT1" 
## [15] "AT_CT2"  "AT_CT3"  "AT_CT4"  "AT_CT5"  "AT_LA1"  "AT_LA2"  "AT_LA3" 
## [22] "AT_LA4"  "AT_LA5"  "AT_SC1"  "AT_SC2"  "AT_SC3"  "AT_SC4"  "AT_SC5" 
## [29] "AT_SC6"  "ST_SD1"  "ST_SD2"  "ST_SD3"  "ST_SD4"  "ST_SD5"  "ST_SR2" 
## [36] "ST_SR1"  "ST_SR3"  "ST_SR4"  "ST_SR5"

Principal Component Analysis Using the Default Method

There are three methods of PCA available in R

prcomp : most common method
princomp : slightly different algorithm
principal : method from psych package

We will use the default method prcomp first

pc <- df %>%
prcomp(center = TRUE, scale = TRUE)

Get summary stats

summary(pc)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     3.9552 2.2443 1.28210 1.24473 1.17405 0.97936 0.93336
## Proportion of Variance 0.4011 0.1291 0.04215 0.03973 0.03534 0.02459 0.02234
## Cumulative Proportion  0.4011 0.5303 0.57241 0.61214 0.64748 0.67208 0.69442
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.91529 0.91023 0.84528 0.80615 0.77563 0.76162 0.74171
## Proportion of Variance 0.02148 0.02124 0.01832 0.01666 0.01543 0.01487 0.01411
## Cumulative Proportion  0.71590 0.73714 0.75546 0.77212 0.78755 0.80242 0.81653
##                           PC15    PC16    PC17   PC18    PC19    PC20   PC21
## Standard deviation     0.72811 0.70806 0.69206 0.6756 0.66110 0.65390 0.6244
## Proportion of Variance 0.01359 0.01286 0.01228 0.0117 0.01121 0.01096 0.0100
## Cumulative Proportion  0.83012 0.84298 0.85526 0.8670 0.87817 0.88913 0.8991
##                           PC22    PC23    PC24    PC25    PC26    PC27   PC28
## Standard deviation     0.60497 0.59783 0.56482 0.55525 0.52770 0.51575 0.4997
## Proportion of Variance 0.00938 0.00916 0.00818 0.00791 0.00714 0.00682 0.0064
## Cumulative Proportion  0.90851 0.91768 0.92586 0.93376 0.94090 0.94772 0.9541
##                           PC29    PC30    PC31    PC32    PC33    PC34    PC35
## Standard deviation     0.48621 0.46285 0.45290 0.44246 0.42277 0.39387 0.36975
## Proportion of Variance 0.00606 0.00549 0.00526 0.00502 0.00458 0.00398 0.00351
## Cumulative Proportion  0.96019 0.96568 0.97094 0.97596 0.98054 0.98452 0.98803
##                           PC36    PC37    PC38    PC39
## Standard deviation     0.36186 0.35104 0.34942 0.30114
## Proportion of Variance 0.00336 0.00316 0.00313 0.00233
## Cumulative Proportion  0.99138 0.99454 0.99767 1.00000

Get a screeplot of the eigenvalues

plot(pc)

In the output, you can see that there is a significant dip after 5 factors.
Hence, there is some impetus to use 5 factors for subsequent analysis.

Very Simple Structure (VSS) Algorithm

Use “Very Simple Structure” to suggest the number of factors
Note: MAP = Minimum Absolute Partial correlation
n is the proposed maximum number of factors

df %>% 
select(1:39) %>%
vss(n = 10)

## 
## Very Simple Structure
## Call: vss(x = ., n = 10)
## VSS complexity 1 achieves a maximimum of 0.86  with  1  factors
## VSS complexity 2 achieves a maximimum of 0.95  with  2  factors
## 
## The Velicer MAP achieves a minimum of 0.01  with  5  factors 
## BIC achieves a minimum of  -137.51  with  10  factors
## Sample Size adjusted BIC achieves a minimum of  1120.59  with  10  factors
## 
## Statistics by number of factors 
##    vss1 vss2   map dof chisq prob sqresid  fit RMSEA   BIC SABIC complex eChisq
## 1  0.86 0.00 0.048 702 23939    0    40.1 0.86 0.131 18630 20861     1.0  45097
## 2  0.67 0.95 0.014 664 11597    0    15.0 0.95 0.093  6575  8685     1.3   7471
## 3  0.54 0.85 0.012 627  8999    0    12.6 0.96 0.083  4258  6250     1.7   5105
## 4  0.60 0.91 0.011 591  7079    0    10.4 0.96 0.076  2610  4488     1.7   3176
## 5  0.47 0.84 0.011 556  5753    0     8.7 0.97 0.070  1548  3315     2.1   1959
## 6  0.52 0.88 0.011 522  4978    0     7.9 0.97 0.067  1031  2689     2.0   1522
## 7  0.48 0.85 0.012 489  4403    0     7.3 0.97 0.064   705  2259     2.1   1199
## 8  0.48 0.86 0.013 457  3735    0     6.7 0.98 0.061   279  1731     2.2    923
## 9  0.46 0.86 0.014 426  3320    0     6.3 0.98 0.059    98  1452     2.2    744
## 10 0.47 0.86 0.016 396  2857    0     6.0 0.98 0.057  -138  1121     2.4    606
##     SRMR eCRMS  eBIC
## 1  0.126 0.129 39788
## 2  0.051 0.054  2449
## 3  0.042 0.046   364
## 4  0.033 0.037 -1293
## 5  0.026 0.030 -2246
## 6  0.023 0.028 -2425
## 7  0.021 0.025 -2498
## 8  0.018 0.023 -2533
## 9  0.016 0.021 -2478
## 10 0.015 0.020 -2388

nFactors Algorithm

We can also use “nfactors” to do the same analysis, and the output will be VSS as well

df %>%
select(1:39) %>%
nfactors(n = 10)

## 
## Number of factors
## Call: vss(x = x, n = n, rotate = rotate, diagonal = diagonal, fm = fm, 
##     n.obs = n.obs, plot = FALSE, title = title, use = use, cor = cor)
## VSS complexity 1 achieves a maximimum of 0.86  with  1  factors
## VSS complexity 2 achieves a maximimum of 0.95  with  2  factors
## The Velicer MAP achieves a minimum of 0.01  with  5  factors 
## Empirical BIC achieves a minimum of  -2532.75  with  8  factors
## Sample Size adjusted BIC achieves a minimum of  1120.59  with  10  factors
## 
## Statistics by number of factors 
##    vss1 vss2   map dof chisq prob sqresid  fit RMSEA   BIC SABIC complex eChisq
## 1  0.86 0.00 0.048 702 23939    0    40.1 0.86 0.131 18630 20861     1.0  45097
## 2  0.67 0.95 0.014 664 11597    0    15.0 0.95 0.093  6575  8685     1.3   7471
## 3  0.54 0.85 0.012 627  8999    0    12.6 0.96 0.083  4258  6250     1.7   5105
## 4  0.60 0.91 0.011 591  7079    0    10.4 0.96 0.076  2610  4488     1.7   3176
## 5  0.47 0.84 0.011 556  5753    0     8.7 0.97 0.070  1548  3315     2.1   1959
## 6  0.52 0.88 0.011 522  4978    0     7.9 0.97 0.067  1031  2689     2.0   1522
## 7  0.48 0.85 0.012 489  4403    0     7.3 0.97 0.064   705  2259     2.1   1199
## 8  0.48 0.86 0.013 457  3735    0     6.7 0.98 0.061   279  1731     2.2    923
## 9  0.46 0.86 0.014 426  3320    0     6.3 0.98 0.059    98  1452     2.2    744
## 10 0.47 0.86 0.016 396  2857    0     6.0 0.98 0.057  -138  1121     2.4    606
##     SRMR eCRMS  eBIC
## 1  0.126 0.129 39788
## 2  0.051 0.054  2449
## 3  0.042 0.046   364
## 4  0.033 0.037 -1293
## 5  0.026 0.030 -2246
## 6  0.023 0.028 -2425
## 7  0.021 0.025 -2498
## 8  0.018 0.023 -2533
## 9  0.016 0.021 -2478
## 10 0.015 0.020 -2388

Factor Analysis

Factor analysis using minimum residual (minres) method and oblimin rotation, which is useful for simple structure.
Need to enter desired number of factors (from VSS or nfactors above).

Calculate and plot factors with fa()

pa5.out <- fa(df[1:39],
nfactors = 5,
fm = "pa",
max.iter = 100,
rotate = "oblimin")

Plot the fa.diagram to visualise the rotated factor solution, and interpret whether it is an interpretable solution

fa.diagram(pa5.out)

Hierarchical Clustering

Hierarchical clustering of items with iclust()

df %>% 
  select(1:39) %>% 
  iclust()

## ICLUST (Item Cluster Analysis)
## Call: iclust(r.mat = .)
## 
## Purified Alpha:
##  C37  C26 
## 0.95 0.88 
## 
## G6* reliability:
## C37 C26 
## 1.0 0.9 
## 
## Original Beta:
##  C37  C26 
## 0.75 0.70 
## 
## Cluster size:
## C37 C26 
##  30   9 
## 
## Item by Cluster Structure matrix:
##           O   P   C37   C26
## NCSS_P1 C37 C26  0.43  0.63
## NCSS_P2 C37 C37  0.32 -0.10
## NCSS_P3 C37 C37  0.56  0.24
## NCSS_P4 C37 C26  0.49  0.53
## NCSS_P5 C37 C26  0.55  0.65
## AT_PT1  C37 C37  0.78  0.38
## AT_PT2  C37 C37  0.79  0.30
## AT_PT3  C37 C37  0.87  0.35
## AT_WT1  C37 C26  0.65  0.82
## AT_WT2  C26 C26  0.06 -0.54
## AT_WT3  C26 C26  0.18  0.67
## AT_WT4  C26 C26 -0.23 -0.71
## AT_WT5  C37 C37  0.58  0.06
## AT_CT1  C37 C37  0.66  0.09
## AT_CT2  C37 C37  0.69  0.06
## AT_CT3  C37 C37  0.84  0.43
## AT_CT4  C37 C37  0.30 -0.17
## AT_CT5  C37 C37  0.55  0.50
## AT_LA1  C37 C37  0.65  0.18
## AT_LA2  C37 C37  0.37 -0.03
## AT_LA3  C37 C37  0.53  0.05
## AT_LA4  C26 C26  0.31  0.65
## AT_LA5  C37 C26  0.58  0.78
## AT_SC1  C37 C37  0.40 -0.02
## AT_SC2  C37 C37  0.80  0.46
## AT_SC3  C37 C37  0.68  0.19
## AT_SC4  C37 C37  0.42  0.10
## AT_SC5  C37 C37  0.29  0.06
## AT_SC6  C37 C37  0.85  0.36
## ST_SD1  C37 C37  0.87  0.41
## ST_SD2  C37 C37  0.90  0.41
## ST_SD3  C37 C37  0.88  0.38
## ST_SD4  C37 C37  0.59  0.19
## ST_SD5  C37 C37  0.51  0.45
## ST_SR2  C37 C37  0.75  0.74
## ST_SR1  C37 C37  0.74  0.65
## ST_SR3  C37 C37  0.76  0.70
## ST_SR4  C37 C37  0.73  0.69
## ST_SR5  C37 C37  0.61  0.57
## 
## With eigenvalues of:
##  C37  C26 
## 13.9  5.5 
## 
## Purified scale intercorrelations
##  reliabilities on diagonal
##  correlations corrected for attenuation above diagonal: 
##      C37  C26
## C37 0.95 0.52
## C26 0.48 0.88
## 
## Cluster fit =  0.86   Pattern fit =  0.98  RMSR =  0.07

Principal Component Analysis with K Factors

First PCA with no rotation, specify 5 factors

df %>% principal(nfactors = 5)

## Principal Components Analysis
## Call: principal(r = ., nfactors = 5)
## Standardized loadings (pattern matrix) based upon correlation matrix
##           RC2   RC1   RC4   RC3   RC5   h2   u2 com
## NCSS_P1  0.44  0.11 -0.20  0.59  0.09 0.60 0.40 2.3
## NCSS_P2  0.02  0.09  0.64  0.38  0.23 0.61 0.39 2.0
## NCSS_P3  0.17  0.35  0.30  0.53  0.25 0.59 0.41 3.2
## NCSS_P4  0.43  0.19  0.01  0.69 -0.19 0.74 0.26 2.0
## NCSS_P5  0.54  0.15 -0.01  0.68 -0.03 0.78 0.22 2.0
## AT_PT1   0.33  0.70  0.03  0.23  0.13 0.68 0.32 1.8
## AT_PT2   0.34  0.69  0.19  0.16  0.11 0.67 0.33 1.8
## AT_PT3   0.41  0.73  0.18  0.10  0.21 0.79 0.21 2.0
## AT_WT1   0.78  0.25 -0.20  0.18  0.12 0.76 0.24 1.5
## AT_WT2  -0.32  0.15  0.64  0.00 -0.03 0.54 0.46 1.6
## AT_WT3   0.51 -0.12 -0.43  0.18  0.17 0.52 0.48 2.6
## AT_WT4  -0.42 -0.04  0.55 -0.33 -0.11 0.61 0.39 2.7
## AT_WT5   0.26  0.58  0.35 -0.10 -0.16 0.56 0.44 2.3
## AT_CT1   0.27  0.58  0.40 -0.05  0.07 0.58 0.42 2.3
## AT_CT2   0.19  0.66  0.36  0.00  0.13 0.62 0.38 1.8
## AT_CT3   0.49  0.67  0.11  0.07  0.20 0.74 0.26 2.1
## AT_CT4  -0.01  0.34  0.35 -0.28  0.21 0.37 0.63 3.6
## AT_CT5   0.76  0.13  0.16  0.00 -0.02 0.61 0.39 1.2
## AT_LA1   0.22  0.68  0.12 -0.02  0.12 0.54 0.46 1.3
## AT_LA2  -0.22  0.61 -0.07  0.08  0.28 0.51 0.49 1.8
## AT_LA3  -0.10  0.78 -0.07  0.10  0.07 0.64 0.36 1.1
## AT_LA4   0.53  0.02 -0.38  0.15  0.22 0.50 0.50 2.4
## AT_LA5   0.77  0.13 -0.15  0.23  0.19 0.72 0.28 1.4
## AT_SC1   0.16  0.21  0.49  0.03  0.33 0.42 0.58 2.4
## AT_SC2   0.61  0.48  0.17 -0.04  0.35 0.76 0.24 2.8
## AT_SC3   0.34  0.45  0.37  0.05  0.33 0.57 0.43 3.8
## AT_SC4   0.19  0.26  0.17 -0.16  0.62 0.54 0.46 1.9
## AT_SC5  -0.06  0.23  0.01  0.12  0.69 0.55 0.45 1.3
## AT_SC6   0.44  0.71  0.14  0.01  0.24 0.78 0.22 2.0
## ST_SD1   0.46  0.72  0.11  0.10  0.18 0.79 0.21 2.0
## ST_SD2   0.45  0.76  0.11  0.14  0.14 0.83 0.17 1.8
## ST_SD3   0.43  0.75  0.13  0.16  0.10 0.80 0.20 1.8
## ST_SD4   0.07  0.75 -0.07  0.20 -0.09 0.61 0.39 1.2
## ST_SD5   0.58  0.25  0.00  0.17 -0.19 0.47 0.53 1.8
## ST_SR2   0.87  0.27  0.01  0.12  0.14 0.86 0.14 1.3
## ST_SR1   0.84  0.30  0.05  0.10  0.03 0.81 0.19 1.3
## ST_SR3   0.82  0.32  0.00  0.18  0.06 0.82 0.18 1.4
## ST_SR4   0.80  0.33 -0.04  0.21 -0.04 0.79 0.21 1.5
## ST_SR5   0.63  0.33 -0.04  0.25 -0.14 0.59 0.41 2.0
## 
##                        RC2  RC1  RC4  RC3  RC5
## SS loadings           9.08 8.83 2.82 2.49 2.03
## Proportion Var        0.23 0.23 0.07 0.06 0.05
## Cumulative Var        0.23 0.46 0.53 0.60 0.65
## Proportion Explained  0.36 0.35 0.11 0.10 0.08
## Cumulative Proportion 0.36 0.71 0.82 0.92 1.00
## 
## Mean item complexity =  2
## Test of the hypothesis that 5 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.04 
##  with the empirical chi square  3907.38  with prob <  0 
## 
## Fit based upon off diagonal values = 0.99

Second PCA with oblimin (oblique) rotation

df %>% 
  principal(
    nfactors = 5, 
    rotate = "oblimin"
  ) %>% 
  plot()  # Plot position of variables on components

Item Analysis

Forming a key for scale scores and a scoring key with indices defined by PCA for the variables

key <- list(
MR1 = c(6, 7 ,8, 13, 14, 15, 16, 17, 19, 20, 21, 29, 30, 31, 32, 33),
MR2 = c(2, 10, 11, 12, 22),
MR3 = c(9,18, 23, 25, 34, 35, 36, 37, 38, 39),
MR4 = c(1, 3, 4, 5),
MR5 = c(24, 26, 27, 28)
)

Matrix of alphas, correlations, and correlations corrected for attentuation using scoreItems() from psych

scoreItems(key, df)

## Call: scoreItems(keys = key, items = df)
## 
## (Unstandardized) Alpha:
##        MR1  MR2  MR3  MR4  MR5
## alpha 0.88 0.77 0.76 0.83 0.89
## 
## Standard errors of unstandardized Alpha:
##          MR1   MR2   MR3   MR4   MR5
## ASE   0.0061 0.015 0.011 0.015 0.013
## 
## Average item correlation:
##           MR1 MR2  MR3  MR4  MR5
## average.r 0.3 0.4 0.24 0.55 0.66
## 
## Median item correlation:
##  MR1  MR2  MR3  MR4  MR5 
## 0.32 0.46 0.25 0.54 0.64 
## 
##  Guttman 6* reliability: 
##           MR1  MR2  MR3  MR4  MR5
## Lambda.6 0.94 0.87 0.86 0.88 0.91
## 
## Signal/Noise based upon av.r : 
##              MR1 MR2 MR3 MR4 MR5
## Signal/Noise   7 3.4 3.2   5 7.8
## 
## Scale intercorrelations corrected for attenuation 
##  raw correlations below the diagonal, alpha on the diagonal 
##  corrected correlations above the diagonal:
##      MR1  MR2  MR3  MR4  MR5
## MR1 0.88 1.04 1.03 0.98 0.89
## MR2 0.86 0.77 0.97 1.04 0.76
## MR3 0.85 0.75 0.76 0.95 0.81
## MR4 0.83 0.83 0.76 0.83 0.65
## MR5 0.78 0.63 0.67 0.55 0.89
## 
##  In order to see the item by scale loadings and frequency counts of the data
##  print with the short option = FALSE

Add Scale Scores to Data

df %<>% 
mutate(
MR1 = rowMeans(df[c(6, 7 ,8, 13, 14, 15, 16, 17, 19, 20, 21, 29, 30, 31, 32, 33)], na.rm = T),
MR2 = rowMeans(df[c(2, 10, 11, 12, 22)], na.rm = T),
MR3 = rowMeans(df[c(9,18, 23, 25, 34, 35, 36, 37, 38, 39)], na.rm = T),
MR4 = rowMeans(df[c(1, 3, 4, 5)], na.rm = T),
MR5 = rowMeans(df[c(24, 26, 27, 28)], na.rm = T)
)

Descriptives for scale scores

df %>% 
select(MR1:MR5) %>%
describe()

##     vars    n mean   sd median trimmed  mad min max range  skew kurtosis   se
## MR1    1 1924 2.94 0.98    3.0    2.95 1.11 1.0 5.0   4.0 -0.06    -0.77 0.02
## MR2    2 1924 2.85 0.46    2.8    2.84 0.30 1.2 4.6   3.4  0.12     0.87 0.01
## MR3    3 1924 3.31 1.17    3.7    3.37 1.33 1.0 5.0   4.0 -0.39    -1.25 0.03
## MR4    4 1924 2.96 1.02    3.0    2.95 1.11 1.0 5.0   4.0  0.06    -0.81 0.02
## MR5    5 1924 2.83 0.86    3.0    2.83 0.74 1.0 5.0   4.0  0.06    -0.02 0.02

Plot of Means and SDs for Aggregated Variables

df %>% 
select(MR1:MR5) %>%
error.bars(
ylim = c(1, 5),
sd = T,
arrow.col = "gray", 
eyes = F,
pch = 19,
main = "M & SD of C2C Aggregated Factors",
ylab = "Means",
xlab = "C2C Aggregated Factors"
)

Pairs Panels

This is a whole bunch of histograms, scatterplots, correlation plots, etc, to summarise the outcome of the factor and item analysis done in previous sections.

df %>% 
select(MR1:MR5) %>%
pairs.panels(
hist.col = "gray", 
jiggle = T,
main = "C2C Aggregated Factors"
)

Cronbach’s Alpha

Cronbach’s alpha is a measure of internal consistency, that is, how closely related a set of items are as a group. Cronbach’s alpha needs to be computed for each of the aggregated factors and their components. This is an example with MR1.

df %>%
select (c(6, 7 ,8, 13, 14, 15, 16, 17, 19, 20, 21, 29, 30, 31, 32, 33)) %>%
psych::alpha()

## 
## Reliability analysis   
## Call: psych::alpha(x = .)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N    ase mean   sd median_r
##       0.95      0.95    0.96      0.53  18 0.0016  2.9 0.98     0.53
## 
##  lower alpha upper     95% confidence boundaries
## 0.95 0.95 0.95 
## 
##  Reliability if an item is dropped:
##        raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## AT_PT1      0.94      0.94    0.95      0.52  16   0.0017 0.029  0.53
## AT_PT2      0.94      0.94    0.95      0.52  16   0.0017 0.029  0.52
## AT_PT3      0.94      0.94    0.95      0.51  16   0.0018 0.027  0.52
## AT_WT5      0.95      0.95    0.95      0.54  17   0.0016 0.030  0.53
## AT_CT1      0.95      0.94    0.95      0.53  17   0.0017 0.030  0.52
## AT_CT2      0.95      0.94    0.95      0.53  17   0.0017 0.031  0.52
## AT_CT3      0.94      0.94    0.95      0.52  16   0.0018 0.029  0.52
## AT_CT4      0.95      0.95    0.96      0.56  19   0.0015 0.021  0.54
## AT_LA1      0.95      0.94    0.95      0.53  17   0.0017 0.031  0.53
## AT_LA2      0.95      0.95    0.96      0.55  19   0.0015 0.026  0.54
## AT_LA3      0.95      0.95    0.95      0.54  17   0.0016 0.031  0.54
## AT_SC6      0.94      0.94    0.95      0.51  16   0.0018 0.028  0.52
## ST_SD1      0.94      0.94    0.95      0.51  16   0.0018 0.027  0.52
## ST_SD2      0.94      0.94    0.95      0.51  16   0.0018 0.026  0.52
## ST_SD3      0.94      0.94    0.95      0.51  16   0.0018 0.026  0.52
## ST_SD4      0.95      0.95    0.95      0.54  17   0.0016 0.030  0.53
## 
##  Item statistics 
##           n raw.r std.r r.cor r.drop mean  sd
## AT_PT1 1924  0.79  0.79  0.78   0.76  3.0 1.3
## AT_PT2 1924  0.82  0.82  0.81   0.79  2.9 1.3
## AT_PT3 1924  0.88  0.88  0.88   0.86  2.9 1.4
## AT_WT5 1924  0.65  0.66  0.62   0.60  2.7 1.2
## AT_CT1 1924  0.71  0.72  0.70   0.67  2.7 1.2
## AT_CT2 1924  0.76  0.76  0.75   0.72  2.8 1.2
## AT_CT3 1924  0.84  0.83  0.83   0.81  3.1 1.5
## AT_CT4 1924  0.42  0.42  0.37   0.35  2.7 1.2
## AT_LA1 1924  0.73  0.73  0.71   0.69  3.1 1.3
## AT_LA2 1924  0.49  0.51  0.46   0.44  3.0 1.1
## AT_LA3 1924  0.65  0.66  0.63   0.61  3.0 1.2
## AT_SC6 1924  0.87  0.86  0.86   0.84  3.1 1.4
## ST_SD1 1924  0.87  0.87  0.87   0.85  3.0 1.4
## ST_SD2 1924  0.89  0.89  0.89   0.87  3.1 1.4
## ST_SD3 1924  0.88  0.87  0.88   0.86  3.0 1.4
## ST_SD4 1924  0.67  0.67  0.64   0.62  2.9 1.2
## 
## Non missing response frequency for each item
##           1    2    3    4    5 miss
## AT_PT1 0.15 0.22 0.28 0.16 0.19    0
## AT_PT2 0.17 0.23 0.28 0.16 0.17    0
## AT_PT3 0.20 0.25 0.20 0.15 0.21    0
## AT_WT5 0.20 0.21 0.36 0.13 0.10    0
## AT_CT1 0.17 0.27 0.30 0.16 0.09    0
## AT_CT2 0.17 0.25 0.32 0.16 0.10    0
## AT_CT3 0.20 0.18 0.19 0.17 0.26    0
## AT_CT4 0.20 0.24 0.31 0.16 0.09    0
## AT_LA1 0.14 0.19 0.31 0.20 0.17    0
## AT_LA2 0.10 0.23 0.38 0.20 0.10    0
## AT_LA3 0.09 0.25 0.34 0.19 0.14    0
## AT_SC6 0.17 0.23 0.23 0.12 0.25    0
## ST_SD1 0.19 0.20 0.24 0.12 0.25    0
## ST_SD2 0.18 0.19 0.24 0.16 0.23    0
## ST_SD3 0.19 0.21 0.24 0.16 0.21    0
## ST_SD4 0.15 0.19 0.39 0.15 0.12    0

df %>%
pull(AT_CT4) %>%
hist()

Item Response Theory

Item Response Theory Output for MR1

df %>%
select (c(6, 7 ,8, 13, 14, 15, 16, 17, 19, 20, 21, 29, 30, 31, 32, 33)) %>%
irt.fa %T>%  # T-pipe
plot(main = "C2C Aggregated Factor : MR1") %>% 
print()

## Item Response Analysis using Factor Analysis  
## 
## Call: irt.fa(x = .)
## Item Response Analysis using Factor Analysis  
## 
##  Summary information by factor and item
##  Factor =  1 
##               -3    -2    -1     0     1     2    3
## AT_PT1      0.35  0.79  1.06  1.17  1.16  0.78 0.29
## AT_PT2      0.35  0.86  1.12  1.22  1.29  0.98 0.38
## AT_PT3      0.69  1.49  1.17  1.76  2.00  1.87 0.64
## AT_WT5      0.20  0.36  0.54  0.63  0.62  0.49 0.30
## AT_CT1      0.22  0.44  0.66  0.78  0.77  0.61 0.35
## AT_CT2      0.25  0.55  0.81  0.91  0.91  0.74 0.40
## AT_CT3      0.31  1.02  1.54  1.70  1.50  0.74 0.18
## AT_CT4      0.12  0.15  0.17  0.18  0.18  0.16 0.13
## AT_LA1      0.30  0.57  0.80  0.86  0.77  0.52 0.25
## AT_LA2      0.17  0.22  0.26  0.27  0.26  0.22 0.17
## AT_LA3      0.26  0.41  0.54  0.60  0.55  0.41 0.24
## AT_SC6      0.57  1.23  1.32  1.61  1.95  1.11 0.22
## ST_SD1      0.57  1.48  1.55  1.47  2.05  1.44 0.28
## ST_SD2      1.26  1.40  1.85  1.26  2.02  1.96 0.49
## ST_SD3      0.72  1.48  1.53  1.42  1.83  1.68 0.49
## ST_SD4      0.24  0.43  0.60  0.67  0.63  0.48 0.29
## Test Info   6.57 12.88 15.51 16.51 18.49 14.18 5.12
## SEM         0.39  0.28  0.25  0.25  0.23  0.27 0.44
## Reliability 0.85  0.92  0.94  0.94  0.95  0.93 0.80
## 
## Factor analysis with Call: fa(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, 
##     fm = fm)
## 
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 104  and the objective function was  1.82 
## The number of observations was  1924  with Chi Square =  3495.98  with prob <  0 
## 
## The root mean square of the residuals (RMSA) is  0.05 
## The df corrected root mean square of the residuals is  0.06 
## 
## Tucker Lewis Index of factoring reliability =  0.873
## RMSEA index =  0.13  and the 10 % confidence intervals are  0.127 0.134
## BIC =  2709.51

Item Response Theory Output for MR2

df %>%
select (c(2, 10, 11, 12, 22)) %>%
irt.fa %T>%  # T-pipe
plot(main = "C2C Aggregated Factor : MR2") %>% 
print()

## Item Response Analysis using Factor Analysis  
## 
## Call: irt.fa(x = .)
## Item Response Analysis using Factor Analysis  
## 
##  Summary information by factor and item
##  Factor =  1 
##                -3   -2   -1    0    1    2    3
## AT_WT2       0.19 0.35 0.52 0.62 0.61 0.49 0.32
## AT_WT3       0.25 0.49 0.68 0.74 0.72 0.61 0.40
## AT_WT4       0.22 0.52 0.79 0.84 0.83 0.78 0.56
## AT_LA4       0.18 0.30 0.42 0.49 0.47 0.38 0.26
## Test Info    0.85 1.66 2.41 2.69 2.63 2.26 1.53
## SEM          1.08 0.78 0.64 0.61 0.62 0.66 0.81
## Reliability -0.17 0.40 0.58 0.63 0.62 0.56 0.35
## 
## Factor analysis with Call: fa(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, 
##     fm = fm)
## 
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 5  and the objective function was  0.1 
## The number of observations was  1924  with Chi Square =  193.02  with prob <  8.9e-40 
## 
## The root mean square of the residuals (RMSA) is  0.07 
## The df corrected root mean square of the residuals is  0.1 
## 
## Tucker Lewis Index of factoring reliability =  0.846
## RMSEA index =  0.14  and the 10 % confidence intervals are  0.123 0.157
## BIC =  155.2

Item Response Theory Output for MR3

df %>%
select (c(9,18, 23, 25, 34, 35, 36, 37, 38, 39)) %>%
irt.fa %T>%  # T-pipe
plot(main = "C2C Aggregated Factor : MR3") %>% 
print()

## Item Response Analysis using Factor Analysis  
## 
## Call: irt.fa(x = .)
## Item Response Analysis using Factor Analysis  
## 
##  Summary information by factor and item
##  Factor =  1 
##               -3    -2    -1     0     1    2    3
## AT_WT1      0.52  1.14  1.49  1.43  1.00 0.39 0.10
## AT_CT5      0.30  0.57  0.79  0.85  0.70 0.43 0.21
## AT_LA5      0.40  0.99  1.43  1.48  1.03 0.39 0.10
## AT_SC2      0.21  0.53  0.96  1.17  0.92 0.47 0.18
## ST_SD5      0.29  0.49  0.66  0.68  0.57 0.37 0.20
## ST_SR2      1.42  2.49  1.44  2.24  4.47 0.50 0.00
## ST_SR1      1.07  2.17  2.16  2.19  1.82 1.16 0.11
## ST_SR3      2.13  1.68  1.59  2.39  1.13 2.19 0.26
## ST_SR4      1.00  1.64  1.88  1.80  1.51 0.88 0.13
## ST_SR5      0.46  0.75  0.85  0.83  0.72 0.45 0.20
## Test Info   7.81 12.45 13.25 15.06 13.85 7.23 1.49
## SEM         0.36  0.28  0.27  0.26  0.27 0.37 0.82
## Reliability 0.87  0.92  0.92  0.93  0.93 0.86 0.33
## 
## Factor analysis with Call: fa(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, 
##     fm = fm)
## 
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 35  and the objective function was  0.76 
## The number of observations was  1924  with Chi Square =  1452.62  with prob <  7.1e-283 
## 
## The root mean square of the residuals (RMSA) is  0.04 
## The df corrected root mean square of the residuals is  0.04 
## 
## Tucker Lewis Index of factoring reliability =  0.917
## RMSEA index =  0.145  and the 10 % confidence intervals are  0.139 0.152
## BIC =  1187.94

Item Response Theory Output for MR4

df %>%
select (c(1, 3, 4, 5)) %>%
irt.fa %T>%  # T-pipe
plot(main = "C2C Aggregated Factor : MR4") %>% 
print()

## Item Response Analysis using Factor Analysis  
## 
## Call: irt.fa(x = .)
## Item Response Analysis using Factor Analysis  
## 
##  Summary information by factor and item
##  Factor =  1 
##                -3   -2   -1    0    1    2    3
## NCSS_P1      0.35 0.58 0.75 0.77 0.63 0.38 0.18
## NCSS_P3      0.15 0.21 0.26 0.29 0.29 0.25 0.19
## NCSS_P4      0.37 0.84 1.04 1.09 1.16 1.02 0.51
## NCSS_P5      0.02 0.17 4.93 0.33 2.44 1.22 0.47
## Test Info    0.89 1.80 6.97 2.48 4.51 2.87 1.35
## SEM          1.06 0.75 0.38 0.63 0.47 0.59 0.86
## Reliability -0.12 0.44 0.86 0.60 0.78 0.65 0.26
## 
## Factor analysis with Call: fa(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, 
##     fm = fm)
## 
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 2  and the objective function was  0.01 
## The number of observations was  1924  with Chi Square =  11.09  with prob <  0.0039 
## 
## The root mean square of the residuals (RMSA) is  0.01 
## The df corrected root mean square of the residuals is  0.02 
## 
## Tucker Lewis Index of factoring reliability =  0.993
## RMSEA index =  0.049  and the 10 % confidence intervals are  0.023 0.078
## BIC =  -4.04

Item Response Theory Output for MR5

df %>%
select (c(24, 26, 27, 28)) %>%
irt.fa %T>%  # T-pipe
plot(main = "C2C Aggregated Factor : MR5") %>% 
print()

## Item Response Analysis using Factor Analysis  
## 
## Call: irt.fa(x = .)
## Item Response Analysis using Factor Analysis  
## 
##  Summary information by factor and item
##  Factor =  1 
##                -3   -2   -1    0    1    2    3
## AT_SC1       0.18 0.28 0.38 0.43 0.42 0.34 0.24
## AT_SC3       0.25 0.65 1.00 1.10 1.16 0.90 0.38
## AT_SC4       0.23 0.36 0.49 0.55 0.51 0.39 0.25
## AT_SC5       0.14 0.17 0.20 0.21 0.21 0.18 0.15
## Test Info    0.79 1.47 2.07 2.29 2.30 1.82 1.01
## SEM          1.12 0.82 0.70 0.66 0.66 0.74 0.99
## Reliability -0.26 0.32 0.52 0.56 0.57 0.45 0.01
## 
## Factor analysis with Call: fa(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, 
##     fm = fm)
## 
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 2  and the objective function was  0.06 
## The number of observations was  1924  with Chi Square =  106.46  with prob <  7.6e-24 
## 
## The root mean square of the residuals (RMSA) is  0.06 
## The df corrected root mean square of the residuals is  0.1 
## 
## Tucker Lewis Index of factoring reliability =  0.794
## RMSEA index =  0.165  and the 10 % confidence intervals are  0.139 0.192
## BIC =  91.33

Confirmatory Factor Analysis

Define the factors for the CFA model

cfa.model <- 'MR1L =~ AT_PT1 + AT_PT2 + AT_PT3 + AT_WT5 + AT_CT1 + AT_CT2 + AT_CT3 + AT_CT4 + AT_LA1 + AT_LA2 + AT_LA3 + AT_SC6 + ST_SD1 + ST_SD2 + ST_SD3 + ST_SD4
MR2L =~ NCSS_P2 + AT_WT2 + AT_WT3 + AT_WT4 + AT_LA4
MR3L =~ AT_WT1 + AT_CT5 + AT_LA5 + AT_SC2 + ST_SD5 + ST_SR2 + ST_SR1 + ST_SR3 + ST_SR4 + ST_SR5
MR4L =~ NCSS_P1 + NCSS_P3 + NCSS_P4 + NCSS_P5
MR5L =~ AT_SC1 + AT_SC3 + AT_SC4 + AT_SC5'

Fit the CFA Model

fit <- cfa(cfa.model, data = df)

Check the results, quick check with “User Model versus Baseline Model,” which should have values above 0.9 or possibly 0.95

fit %>% summary(fit.measures = TRUE)

## lavaan 0.6-9 ended normally after 91 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        88
##                                                       
##   Number of observations                          1924
##                                                       
## Model Test User Model:
##                                                        
##   Test statistic                              11636.334
##   Degrees of freedom                                692
##   P-value (Chi-square)                            0.000
## 
## Model Test Baseline Model:
## 
##   Test statistic                             60703.928
##   Degrees of freedom                               741
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.817
##   Tucker-Lewis Index (TLI)                       0.805
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)            -101202.716
##   Loglikelihood unrestricted model (H1)     -95384.549
##                                                       
##   Akaike (AIC)                              202581.432
##   Bayesian (BIC)                            203070.902
##   Sample-size adjusted Bayesian (BIC)       202791.325
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.091
##   90 Percent confidence interval - lower         0.089
##   90 Percent confidence interval - upper         0.092
##   P-value RMSEA <= 0.05                          0.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.103
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   MR1L =~                                             
##     AT_PT1            1.000                           
##     AT_PT2            1.018    0.025   40.916    0.000
##     AT_PT3            1.220    0.026   47.354    0.000
##     AT_WT5            0.693    0.025   28.136    0.000
##     AT_CT1            0.756    0.024   31.508    0.000
##     AT_CT2            0.814    0.024   34.155    0.000
##     AT_CT3            1.188    0.028   42.885    0.000
##     AT_CT4            0.385    0.026   14.690    0.000
##     AT_LA1            0.823    0.025   32.675    0.000
##     AT_LA2            0.424    0.024   17.989    0.000
##     AT_LA3            0.640    0.024   26.744    0.000
##     AT_SC6            1.194    0.026   45.286    0.000
##     ST_SD1            1.238    0.026   47.311    0.000
##     ST_SD2            1.239    0.025   48.806    0.000
##     ST_SD3            1.206    0.025   47.491    0.000
##     ST_SD4            0.714    0.024   29.903    0.000
##   MR2L =~                                             
##     NCSS_P2           1.000                           
##     AT_WT2            3.396    0.506    6.707    0.000
##     AT_WT3           -3.831    0.564   -6.793    0.000
##     AT_WT4            4.009    0.589    6.806    0.000
##     AT_LA4           -3.900    0.578   -6.750    0.000
##   MR3L =~                                             
##     AT_WT1            1.000                           
##     AT_CT5            0.780    0.023   34.374    0.000
##     AT_LA5            1.005    0.024   42.029    0.000
##     AT_SC2            0.979    0.027   36.257    0.000
##     ST_SD5            0.686    0.023   29.992    0.000
##     ST_SR2            1.258    0.023   54.258    0.000
##     ST_SR1            1.178    0.023   51.893    0.000
##     ST_SR3            1.129    0.021   53.474    0.000
##     ST_SR4            1.089    0.022   49.657    0.000
##     ST_SR5            0.711    0.020   35.829    0.000
##   MR4L =~                                             
##     NCSS_P1           1.000                           
##     NCSS_P3           0.668    0.035   19.186    0.000
##     NCSS_P4           1.155    0.035   33.049    0.000
##     NCSS_P5           1.457    0.041   35.784    0.000
##   MR5L =~                                             
##     AT_SC1            1.000                           
##     AT_SC3            1.651    0.075   21.968    0.000
##     AT_SC4            0.980    0.055   17.901    0.000
##     AT_SC5            0.621    0.048   12.811    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   MR1L ~~                                             
##     MR2L             -0.037    0.008   -4.778    0.000
##     MR3L              0.870    0.040   21.630    0.000
##     MR4L              0.451    0.028   16.184    0.000
##     MR5L              0.528    0.031   16.948    0.000
##   MR2L ~~                                             
##     MR3L             -0.141    0.022   -6.529    0.000
##     MR4L             -0.096    0.015   -6.401    0.000
##     MR5L              0.003    0.004    0.866    0.386
##   MR3L ~~                                             
##     MR4L              0.724    0.036   19.955    0.000
##     MR5L              0.407    0.028   14.540    0.000
##   MR4L ~~                                             
##     MR5L              0.189    0.018   10.273    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .AT_PT1            0.640    0.022   29.436    0.000
##    .AT_PT2            0.592    0.020   29.243    0.000
##    .AT_PT3            0.391    0.014   27.147    0.000
##    .AT_WT5            0.926    0.030   30.492    0.000
##    .AT_CT1            0.802    0.026   30.297    0.000
##    .AT_CT2            0.728    0.024   30.097    0.000
##    .AT_CT3            0.652    0.023   28.823    0.000
##    .AT_CT4            1.303    0.042   30.901    0.000
##    .AT_LA1            0.854    0.028   30.215    0.000
##    .AT_LA2            1.012    0.033   30.837    0.000
##    .AT_LA3            0.901    0.029   30.557    0.000
##    .AT_SC6            0.495    0.018   28.097    0.000
##    .ST_SD1            0.405    0.015   27.171    0.000
##    .ST_SD2            0.321    0.012   26.155    0.000
##    .ST_SD3            0.374    0.014   27.068    0.000
##    .ST_SD4            0.833    0.027   30.397    0.000
##    .NCSS_P2           1.340    0.044   30.803    0.000
##    .AT_WT2            0.903    0.033   27.301    0.000
##    .AT_WT3            0.604    0.025   23.834    0.000
##    .AT_WT4            0.564    0.025   22.587    0.000
##    .AT_LA4            0.913    0.035   26.145    0.000
##    .AT_WT1            0.641    0.022   28.872    0.000
##    .AT_CT5            0.909    0.030   30.097    0.000
##    .AT_LA5            0.792    0.027   29.262    0.000
##    .AT_SC2            1.223    0.041   29.941    0.000
##    .ST_SD5            1.017    0.033   30.383    0.000
##    .ST_SR2            0.331    0.014   24.389    0.000
##    .ST_SR1            0.403    0.015   26.242    0.000
##    .ST_SR3            0.300    0.012   25.114    0.000
##    .ST_SR4            0.447    0.016   27.351    0.000
##    .ST_SR5            0.668    0.022   29.979    0.000
##    .NCSS_P1           0.802    0.029   27.892    0.000
##    .NCSS_P3           1.249    0.041   30.187    0.000
##    .NCSS_P4           0.513    0.022   23.611    0.000
##    .NCSS_P5           0.259    0.023   11.505    0.000
##    .AT_SC1            1.044    0.037   28.204    0.000
##    .AT_SC3            0.519    0.036   14.575    0.000
##    .AT_SC4            0.988    0.035   28.159    0.000
##    .AT_SC5            1.188    0.039   30.088    0.000
##     MR1L              1.089    0.052   20.845    0.000
##     MR2L              0.041    0.012    3.435    0.001
##     MR3L              1.357    0.062   22.042    0.000
##     MR4L              0.761    0.045   16.887    0.000
##     MR5L              0.433    0.037   11.692    0.000

And that completes the section on PCA as we now move to exploratory data analysis with the aggregated factors

Data Visualisation with ggplot2

ggplot2 is a R package from the Tidyverse family which is dedicated to data visualization. In this section, we show how data visualisation can be carried out using ggplot2 for the purpose of Exploratory Data Analysis.

These are the types of plots which we will cover in this session.

Histograms
Boxplots
Density Plots
Ridgeline Plots
Violin Plots
Heatmaps

Enjoy !

Loading the Data and Packages

This is the code for installation of Pacman which is used to load all packages for this section. We have also used this command in the earlier section on Principal Component Analysis.

install.packages("pacman",repos = "http://cran.us.r-project.org")

## Warning: package 'pacman' is in use and will not be installed

Load the packages required for this section

pacman: for loading/unloading packages
psych: for psychometric functions
rio: for importing data
tidyverse: for data wrangling and visualisation functions
ggplot2: for the use of ggplot2
ggridges: for the generation of ridgeline plots
vioplot: for the generation of violin plots

pacman::p_load(pacman, psych, rio, tidyverse, ggplot2, ggridges, devtools, vioplot, dplyr)

## Error in get(genname, envir = envir) : object 'testthat_print' not found

You can now import the source data from the csv file which I have placed online at Github via the link below using the read.csv command.

https://raw.githubusercontent.com/aaron-chen-angus/community2campus/main/C2CsurveyAggregated.csv

C2Cagg <- read.csv(file = "https://raw.githubusercontent.com/aaron-chen-angus/community2campus/main/C2CsurveyAggregated.csv", header = TRUE, sep = ",")

Check on the output by reading the column names

C2Cagg %>% colnames()

##   [1] "UNQ_ID"        "AGE.RANGE"     "GENDER"        "NCSS_B03A"    
##   [5] "NCSS_B03B"     "NCSS_B04"      "NCSS_B05"      "EVENT"        
##   [9] "PARTICIPATION" "PART_ROLE"     "AFFILIATION"   "Process_E_LS" 
##  [13] "Process_E_EX"  "Process_E_CT"  "Pre_MR1"       "Post_MR1"     
##  [17] "X_MR1"         "Pre_MR2"       "Post_MR2"      "X_MR2"        
##  [21] "Pre_MR3"       "Post_MR3"      "X_MR3"         "Pre_MR4"      
##  [25] "Post_MR4"      "X_MR4"         "Pre_MR5"       "Post_MR5"     
##  [29] "X_MR5"         "Pre_AT_PT1"    "Pre_AT_PT2"    "Pre_AT_PT3"   
##  [33] "Pre_AT_WT5"    "Pre_AT_CT1"    "Pre_AT_CT2"    "Pre_AT_CT3"   
##  [37] "Pre_AT_CT4"    "Pre_AT_LA1"    "Pre_AT_LA2"    "Pre_AT_LA3"   
##  [41] "Pre_AT_SC6"    "Pre_ST_SD1"    "Pre_ST_SD2"    "Pre_ST_SD3"   
##  [45] "Pre_ST_SD4"    "Pre_NCSS_P2"   "Pre_AT_WT2"    "Pre_AT_WT3"   
##  [49] "Pre_AT_WT4"    "Pre_AT_LA4"    "Pre_AT_WT1"    "Pre_AT_CT5"   
##  [53] "Pre_AT_LA5"    "Pre_AT_SC2"    "Pre_ST_SD5"    "Pre_ST_SR2"   
##  [57] "Pre_ST_SR1"    "Pre_ST_SR3"    "Pre_ST_SR4"    "Pre_ST_SR5"   
##  [61] "Pre_NCSS_P1"   "Pre_NCSS_P3"   "Pre_NCSS_P4"   "Pre_NCSS_P5"  
##  [65] "Pre_AT_SC1"    "Pre_AT_SC3"    "Pre_AT_SC4"    "Pre_AT_SC5"   
##  [69] "Post_AT_PT1"   "Post_AT_PT2"   "Post_AT_PT3"   "Post_AT_WT5"  
##  [73] "Post_AT_CT1"   "Post_AT_CT2"   "Post_AT_CT3"   "Post_AT_CT4"  
##  [77] "Post_AT_LA1"   "Post_AT_LA2"   "Post_AT_LA3"   "Post_ST_SD1"  
##  [81] "Post_ST_SD2"   "Post_ST_SD3"   "Post_ST_SD4"   "Post_AT_SC6"  
##  [85] "Post_NCSS_P2"  "Post_AT_WT2"   "Post_AT_WT3"   "Post_AT_WT4"  
##  [89] "Post_AT_LA4"   "Post_AT_WT1"   "Post_AT_CT5"   "Post_AT_LA5"  
##  [93] "Post_AT_SC2"   "Post_ST_SD5"   "Post_ST_SR2"   "Post_ST_SR1"  
##  [97] "Post_ST_SR3"   "Post_ST_SR4"   "Post_ST_SR5"   "Post_NCSS_P1" 
## [101] "Post_NCSS_P3"  "Post_NCSS_P4"  "Post_NCSS_P5"  "Post_AT_SC1"  
## [105] "Post_AT_SC3"   "Post_AT_SC4"   "Post_AT_SC5"

Histograms for Process Evaluation

We will generate a histogram to show the distribution of ratings for event logistics (field : Process_E_LS), using the new code structure which is based on dplyr pipes.

C2Cagg%>%
ggplot(aes(x=Process_E_LS)) + 
geom_histogram(binwidth=1, fill="green", color="black", alpha=0.9) +
ggtitle("Process Evaluation for Logistics") + ylim(0,1000)

Next, we will generate a histogram to show the distribution of ratings for instructors (field : Process_E_EX).

C2Cagg%>%
ggplot(aes(x=Process_E_EX)) + 
geom_histogram(binwidth=1, fill="blue", color="black", alpha=0.9) +
ggtitle("Process Evaluation for Instructors") + ylim(0,1000)

Next, we will generate a histogram to show the distribution of ratings for programme activities (field : Process_E_CT).

C2Cagg%>%
ggplot(aes(x=Process_E_CT)) + 
geom_histogram(binwidth=1, fill="orange", color="black", alpha=0.9) +
ggtitle("Process Evaluation for Programme") + ylim(0,1000)

Boxplots for Differentiated Process Evaluation

We will generate a boxplot to show the differentiated ratings by gender for event logistics (field : Process_E_LS), this time using dplyr pipes instead of the codes used in Lesson 08.

C2Cagg%>%
ggplot(aes(x=GENDER, y=Process_E_LS)) + 
geom_boxplot(color="red", fill="green", alpha=0.2) + 
ggtitle("Process Evaluation for Logistics by Gender") + ylim(0,7)

We will now generate a boxplot to show the differentiated ratings by gender for instructors (field : Process_E_EX).

C2Cagg%>%
ggplot(aes(x=GENDER, y=Process_E_EX)) + 
geom_boxplot(color="red", fill="blue", alpha=0.2) + 
ggtitle("Process Evaluation for Instructors by Gender") + ylim(0,7)

We will now generate a boxplot to show the differentiated ratings by gender for programmes (field : Process_E_CT).

C2Cagg%>%
ggplot(aes(x=GENDER, y=Process_E_CT)) + 
geom_boxplot(color="red", fill="orange", alpha=0.2) + 
ggtitle("Process Evaluation for Programme by Gender") + ylim(0,7)

Boxplots for Differentiated Process Evaluation

We will generate a boxplot to show the differentiated ratings for event logistics by event (field : Process_E_LS), but using dplyr pipes this time.

C2Cagg%>%
ggplot(aes(x=EVENT, y=Process_E_LS)) + 
geom_boxplot(color="red", fill="green", alpha=0.2) + 
ggtitle("Process Evaluation for Logistics by Event") + 
ylim(0,6) + coord_flip()

We will now generate a boxplot to show the differentiated ratings for event staff by event (field : Process_E_EX)

C2Cagg%>%
ggplot(aes(x=EVENT, y=Process_E_EX)) + 
geom_boxplot(color="red", fill="blue", alpha=0.2) + 
ggtitle("Process Evaluation for Instructors by Event") + 
ylim(0,6) + coord_flip()

We will now generate a boxplot to show the differentiated ratings for programme by event (field : Process_E_CT)

C2Cagg%>%
ggplot(aes(x=EVENT, y=Process_E_CT)) + 
geom_boxplot(color="red", fill="orange", alpha=0.2) + 
ggtitle("Process Evaluation for Instructors by Event") + 
ylim(0,6) + coord_flip()

Boxplots for Aggregated Factors Compared Across Gender

Comparison of Gender in MR1 Aggregated Factor Impacts

C2Cagg%>%
ggplot(aes(x=GENDER, y=X_MR1, fill=GENDER)) + 
geom_boxplot() + 
ggtitle("Gender x MR1 Gains") + ylim(-3,3)

## Warning: Removed 16 rows containing non-finite values (stat_boxplot).

Comparison of Gender in MR2 Aggregated Factor Impacts

C2Cagg%>%
ggplot(aes(x=GENDER, y=X_MR2, fill=GENDER)) + 
geom_boxplot() + 
ggtitle("Gender x MR2 Gains") + ylim(-3,3)

Comparison of Gender in MR3 Aggregated Factor Impacts

C2Cagg%>%
ggplot(aes(x=GENDER, y=X_MR3, fill=GENDER)) + 
geom_boxplot() + 
ggtitle("Gender x MR3 Gains") + ylim(-3,3)

## Warning: Removed 49 rows containing non-finite values (stat_boxplot).

Comparison of Gender in MR4 Aggregated Factor Impacts

C2Cagg%>%
ggplot(aes(x=GENDER, y=X_MR4, fill=GENDER)) + 
geom_boxplot() + 
ggtitle("Gender x MR4 Gains") + ylim(-3,3)

Comparison of Gender in MR5 Aggregated Factor Impacts

C2Cagg%>%
ggplot(aes(x=GENDER, y=X_MR5, fill=GENDER)) + 
geom_boxplot() + 
ggtitle("Gender x MR5 Gains") + ylim(-3,3)

## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

Density Plots for Aggregated Factors Compared Across Experience

The following density plots are constructed to ascertain the impact of knowing someone with a mental health condition (field : NCSS_B03A), on the impact of specific aggregated factors.

Density Plot for NCSS_B03A and MR1

C2Cagg%>%
ggplot(aes(x=X_MR1, color=NCSS_B03A, fill=NCSS_B03A)) +
geom_density(alpha=0.3,size=1)+ 
labs(x= "Change in MR1 Aggregated Scores Following Event",
subtitle="",
caption="Kruskal-Wallis chi-squared = 148.25, df = 70, p-value = 1.495e-07")

Density Plot for NCSS_B03A and MR2

C2Cagg%>%
ggplot(aes(x=X_MR2, color=NCSS_B03A, fill=NCSS_B03A)) +
geom_density(alpha=0.3,size=1)+ 
labs(x= "Change in MR2 Aggregated Scores Following Event",
subtitle="",
caption="Kruskal-Wallis chi-squared = 45.786, df = 15, p-value = 5.749e-05")

Density Plot for NCSS_B03A and MR3

C2Cagg%>%
ggplot(aes(x=X_MR3, color=NCSS_B03A, fill=NCSS_B03A)) +
geom_density(alpha=0.3,size=1)+ 
labs(x= "Change in MR3 Aggregated Scores Following Event",
subtitle="",
caption="Kruskal-Wallis chi-squared = 125.34, df = 50, p-value = 2.1e-08")

Density Plot for NCSS_B03A and MR4

C2Cagg%>%
ggplot(aes(x=X_MR4, color=NCSS_B03A, fill=NCSS_B03A)) +
geom_density(alpha=0.3,size=1)+ 
labs(x= "Change in MR4 Aggregated Scores Following Event",
subtitle="",
caption="Kruskal-Wallis chi-squared = 44.57, df = 16, p-value = 0.0001615")

Density Plot for NCSS_B03A and MR5

C2Cagg%>%
ggplot(aes(x=X_MR5, color=NCSS_B03A, fill=NCSS_B03A)) +
geom_density(alpha=0.3,size=1)+ 
labs(x= "Change in MR5 Aggregated Scores Following Event",
subtitle="",
caption="Kruskal-Wallis chi-squared = 42.013, df = 20, p-value = 0.002755")

Ridgeline Plots for Aggregated Factors Compared Across Education Level

The following ridgeline plots are constructed to ascertain the impact of participants’ education level (field : NCSS_B05A), on the impact of specific aggregated factors.

Density Plot for NCSS_B05A and MR1

C2Cagg%>%
ggplot(aes(x = X_MR1, y = NCSS_B05, fill = NCSS_B05)) +
geom_density_ridges() +
theme_ridges() + 
theme(legend.position = "none")

## Picking joint bandwidth of 0.355

Density Plot for NCSS_B05A and MR2

C2Cagg%>%
ggplot(aes(x = X_MR2, y = NCSS_B05, fill = NCSS_B05)) +
geom_density_ridges() +
theme_ridges() + 
theme(legend.position = "none")

## Picking joint bandwidth of 0.211

Density Plot for NCSS_B05A and MR3

C2Cagg%>%
ggplot(aes(x = X_MR3, y = NCSS_B05, fill = NCSS_B05)) +
geom_density_ridges() +
theme_ridges() + 
theme(legend.position = "none")

## Picking joint bandwidth of 0.393

Density Plot for NCSS_B05A and MR4

C2Cagg%>%
ggplot(aes(x = X_MR4, y = NCSS_B05, fill = NCSS_B05)) +
geom_density_ridges() +
theme_ridges() + 
theme(legend.position = "none")

## Picking joint bandwidth of 0.233

Density Plot for NCSS_B05A and MR5

C2Cagg%>%
ggplot(aes(x = X_MR5, y = NCSS_B05, fill = NCSS_B05)) +
geom_density_ridges() +
theme_ridges() + 
theme(legend.position = "none")

## Picking joint bandwidth of 0.259

Violin Plots for Aggregated Factors Compared Across Engagement Level

The following violin plots are constructed to ascertain the impact of participants’ engagement level (field : PARTICIPATION), on the impact of specific aggregated factors.

First, we will need to ensure the PARTICIPATION field is a factor.

C2Cagg$PARTICIPATION <- as.factor(C2Cagg$PARTICIPATION)

Violin Plot for PARTICIPATION and MR1

C2Cagg%>%
ggplot(aes(x=PARTICIPATION, 
           y=X_MR1, 
           fill=PARTICIPATION)) + 
  geom_violin()

Violin Plot for PARTICIPATION and MR2

C2Cagg%>%
ggplot(aes(x=PARTICIPATION, 
           y=X_MR2, 
           fill=PARTICIPATION)) + 
  geom_violin()

Violin Plot for PARTICIPATION and MR3

C2Cagg%>%
ggplot(aes(x=PARTICIPATION, 
           y=X_MR3, 
           fill=PARTICIPATION)) + 
geom_violin()

Violin Plot for PARTICIPATION and MR4

C2Cagg%>%
ggplot(aes(x=PARTICIPATION, 
           y=X_MR4, 
           fill=PARTICIPATION)) + 
geom_violin()

Violin Plot for PARTICIPATION and MR5

C2Cagg%>%
ggplot(aes(x=PARTICIPATION, 
           y=X_MR5, 
           fill=PARTICIPATION)) + 
geom_violin()

Question for Review : What kind of data visualisations or representations would violin plots be most suited for ?

Building a Heatmap with a Data Subset

We will build a simple heatmap to show the impact of various events on the aggregated factors.

Impact on MR1 Compared Across Various Events

C2Cagg%>%
ggplot(aes(x=X_MR1, y=EVENT, fill = X_MR1)) + 
geom_tile() + 
xlab(label = "Impact on MR1 : Addressing Stigma") +
scale_fill_gradient(name = "MR1 Shift",
                      low = "#FFFFFF",
                      high = "#012345")

Impact on MR2 Compared Across Various Events

C2Cagg%>%
ggplot(aes(x=X_MR2, y=EVENT, fill = X_MR2)) + 
geom_tile() + 
xlab(label = "Impact on MR2 : Mental Health Literacy : Wishful Thinking") +
scale_fill_gradient(name = "MR2 Shift",
                      low = "#FFFFFF",
                      high = "#012345")

Impact on MR3 Compared Across Various Events

C2Cagg%>%
ggplot(aes(x=X_MR3, y=EVENT, fill = X_MR3)) + 
geom_tile() + 
xlab(label = "Impact on MR3 : Promoting Mental Health Advocacy") +
scale_fill_gradient(name = "MR3 Shift",
                      low = "#FFFFFF",
                      high = "#012345")

Impact on MR4 Compared Across Various Events

C2Cagg%>%
ggplot(aes(x=X_MR4, y=EVENT, fill = X_MR4)) + 
geom_tile() + 
xlab(label = "Impact on MR4 : Mental Health Literacy : Relationships") +
scale_fill_gradient(name = "MR4 Shift",
                      low = "#FFFFFF",
                      high = "#012345")

Impact on MR5 Compared Across Various Events

C2Cagg%>%
ggplot(aes(x=X_MR5, y=EVENT, fill = X_MR5)) + 
geom_tile() + 
xlab(label = "Impact on MR5 : Mental Health Literacy : Social Constructivism") +
scale_fill_gradient(name = "MR5 Shift",
                      low = "#FFFFFF",
                      high = "#012345")

And that completes the section on Exploratory Data Analysis. We will now proceed to the next section on Data Modelling

Data Modelling and Analysis

In this section, we will continue to use the CommunityᄅCampus dataset which was used in the previous section to carry out the following analyses.

Paired T-Test
Wilcoxon Rank-Signed Test
Two Sample T-Test
Mann-Whitney U-Test
One-Way and Two-Way ANOVA
Kruskal-Wallis H Test
Pearson’s Correlation Test
Spearman’s Rank Correlation Test

Loading the Data and Packages

This is the code for installation of Pacman which is used to load all packages for this section.

install.packages("pacman",repos = "http://cran.us.r-project.org")

## Warning: package 'pacman' is in use and will not be installed

Load the packages required for this section

pacman: for loading/unloading packages
psych: for psychometric functions
rio: for importing data
tidyverse: for data wrangling and visualisation functions
dplyr: specifically for data wrangling
ggplot2: specifically for ggplot functions
ggpubr: for enhancement of ggplot outputs for publication
AICcmodavg : for finding the best-fit ANOVA model

pacman::p_load(pacman, psych, rio, tidyverse, dplyr, magrittr, ggplot2, devtools, ggpubr, AICcmodavg, car)

You can now import the source data from the csv file which I have placed online at Github via the link below using the read.csv command.

https://raw.githubusercontent.com/aaron-chen-angus/community2campus/main/C2CsurveyAggregated.csv

C2Cagg <- read.csv(file = "https://raw.githubusercontent.com/aaron-chen-angus/community2campus/main/C2CsurveyAggregated.csv", header = TRUE, sep = ",")

Check on the output by reading the column names

C2Cagg %>% colnames()

##   [1] "UNQ_ID"        "AGE.RANGE"     "GENDER"        "NCSS_B03A"    
##   [5] "NCSS_B03B"     "NCSS_B04"      "NCSS_B05"      "EVENT"        
##   [9] "PARTICIPATION" "PART_ROLE"     "AFFILIATION"   "Process_E_LS" 
##  [13] "Process_E_EX"  "Process_E_CT"  "Pre_MR1"       "Post_MR1"     
##  [17] "X_MR1"         "Pre_MR2"       "Post_MR2"      "X_MR2"        
##  [21] "Pre_MR3"       "Post_MR3"      "X_MR3"         "Pre_MR4"      
##  [25] "Post_MR4"      "X_MR4"         "Pre_MR5"       "Post_MR5"     
##  [29] "X_MR5"         "Pre_AT_PT1"    "Pre_AT_PT2"    "Pre_AT_PT3"   
##  [33] "Pre_AT_WT5"    "Pre_AT_CT1"    "Pre_AT_CT2"    "Pre_AT_CT3"   
##  [37] "Pre_AT_CT4"    "Pre_AT_LA1"    "Pre_AT_LA2"    "Pre_AT_LA3"   
##  [41] "Pre_AT_SC6"    "Pre_ST_SD1"    "Pre_ST_SD2"    "Pre_ST_SD3"   
##  [45] "Pre_ST_SD4"    "Pre_NCSS_P2"   "Pre_AT_WT2"    "Pre_AT_WT3"   
##  [49] "Pre_AT_WT4"    "Pre_AT_LA4"    "Pre_AT_WT1"    "Pre_AT_CT5"   
##  [53] "Pre_AT_LA5"    "Pre_AT_SC2"    "Pre_ST_SD5"    "Pre_ST_SR2"   
##  [57] "Pre_ST_SR1"    "Pre_ST_SR3"    "Pre_ST_SR4"    "Pre_ST_SR5"   
##  [61] "Pre_NCSS_P1"   "Pre_NCSS_P3"   "Pre_NCSS_P4"   "Pre_NCSS_P5"  
##  [65] "Pre_AT_SC1"    "Pre_AT_SC3"    "Pre_AT_SC4"    "Pre_AT_SC5"   
##  [69] "Post_AT_PT1"   "Post_AT_PT2"   "Post_AT_PT3"   "Post_AT_WT5"  
##  [73] "Post_AT_CT1"   "Post_AT_CT2"   "Post_AT_CT3"   "Post_AT_CT4"  
##  [77] "Post_AT_LA1"   "Post_AT_LA2"   "Post_AT_LA3"   "Post_ST_SD1"  
##  [81] "Post_ST_SD2"   "Post_ST_SD3"   "Post_ST_SD4"   "Post_AT_SC6"  
##  [85] "Post_NCSS_P2"  "Post_AT_WT2"   "Post_AT_WT3"   "Post_AT_WT4"  
##  [89] "Post_AT_LA4"   "Post_AT_WT1"   "Post_AT_CT5"   "Post_AT_LA5"  
##  [93] "Post_AT_SC2"   "Post_ST_SD5"   "Post_ST_SR2"   "Post_ST_SR1"  
##  [97] "Post_ST_SR3"   "Post_ST_SR4"   "Post_ST_SR5"   "Post_NCSS_P1" 
## [101] "Post_NCSS_P3"  "Post_NCSS_P4"  "Post_NCSS_P5"  "Post_AT_SC1"  
## [105] "Post_AT_SC3"   "Post_AT_SC4"   "Post_AT_SC5"

Paired T-Test

The paired T-Test is a very useful statistical technique to compare pre- and post- data in a population. This is applicable when we have parametric data.

We will now carry out a paired T-Test (with the assumtion of normality) on the Pre- vs Post- values for the Aggregated Factors in the C2C survey.

Paired T-Test for MR1 Pre- and Post- Event

mean(C2Cagg$Pre_MR1)

## [1] 2.663851

mean(C2Cagg$Post_MR1)

## [1] 3.213682

t.test(x = C2Cagg$Pre_MR1, y = C2Cagg$Post_MR1, 
paired = T)

## 
##  Paired t-test
## 
## data:  C2Cagg$Pre_MR1 and C2Cagg$Post_MR1
## t = -18.308, df = 961, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6087675 -0.4908946
## sample estimates:
## mean of the differences 
##              -0.5498311

Paired T-Test for MR2 Pre- and Post- Event

mean(C2Cagg$Pre_MR2)

## [1] 2.834511

mean(C2Cagg$Post_MR2)

## [1] 2.859459

t.test(x = C2Cagg$Pre_MR2, y = C2Cagg$Post_MR2, 
paired = T)

## 
##  Paired t-test
## 
## data:  C2Cagg$Pre_MR2 and C2Cagg$Post_MR2
## t = -1.7998, df = 961, p-value = 0.07221
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.052151106  0.002255056
## sample estimates:
## mean of the differences 
##             -0.02494802

Paired T-Test for MR3 Pre- and Post- Event

mean(C2Cagg$Pre_MR3)

## [1] 2.926819

mean(C2Cagg$Post_MR3)

## [1] 3.68711

t.test(x = C2Cagg$Pre_MR3, y = C2Cagg$Post_MR3, 
paired = T)

## 
##  Paired t-test
## 
## data:  C2Cagg$Pre_MR3 and C2Cagg$Post_MR3
## t = -20.62, df = 961, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.8326508 -0.6879313
## sample estimates:
## mean of the differences 
##              -0.7602911

Paired T-Test for MR4 Pre- and Post- Event

mean(C2Cagg$Pre_MR4)

## [1] 2.898909

mean(C2Cagg$Post_MR4)

## [1] 3.015073

t.test(x = C2Cagg$Pre_MR4, y = C2Cagg$Post_MR4, 
paired = T)

## 
##  Paired t-test
## 
## data:  C2Cagg$Pre_MR4 and C2Cagg$Post_MR4
## t = -6.9026, df = 961, p-value = 9.265e-12
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.14919031 -0.08313818
## sample estimates:
## mean of the differences 
##              -0.1161642

Paired T-Test for MR5 Pre- and Post- Event

mean(C2Cagg$Pre_MR5)

## [1] 2.683472

mean(C2Cagg$Post_MR5)

## [1] 2.97921

t.test(x = C2Cagg$Pre_MR5, y = C2Cagg$Post_MR5, 
paired = T)

## 
##  Paired t-test
## 
## data:  C2Cagg$Pre_MR5 and C2Cagg$Post_MR5
## t = -12.255, df = 961, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3430950 -0.2483811
## sample estimates:
## mean of the differences 
##               -0.295738

Wilcoxon Rank-Signed Test

The Wilcoxon Rank-Signed Test is a used to compare pre- and post- data when we have non-parametric data.

We will now carry out a Wilcoxon Rank-Signed Test (with the assumtion of non-normality) on the Pre- vs Post- values for the Aggregated Factors in the C2C survey.

Wilcoxon Rank-Signed Test for MR1 Pre- and Post- Event

mean(C2Cagg$Pre_MR1)

## [1] 2.663851

mean(C2Cagg$Post_MR1)

## [1] 3.213682

wilcox.test(C2Cagg$Pre_MR1, C2Cagg$Post_MR1, paired=TRUE)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  C2Cagg$Pre_MR1 and C2Cagg$Post_MR1
## V = 19053, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

Wilcoxon Rank-Signed Test for MR2 Pre- and Post- Event

mean(C2Cagg$Pre_MR2)

## [1] 2.834511

mean(C2Cagg$Post_MR2)

## [1] 2.859459

wilcox.test(C2Cagg$Pre_MR2, C2Cagg$Post_MR2, paired=TRUE)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  C2Cagg$Pre_MR2 and C2Cagg$Post_MR2
## V = 52749, p-value = 0.02623
## alternative hypothesis: true location shift is not equal to 0

Wilcoxon Rank-Signed Test for MR3 Pre- and Post- Event

mean(C2Cagg$Pre_MR3)

## [1] 2.926819

mean(C2Cagg$Post_MR3)

## [1] 3.68711

wilcox.test(C2Cagg$Pre_MR3, C2Cagg$Post_MR3, paired=TRUE)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  C2Cagg$Pre_MR3 and C2Cagg$Post_MR3
## V = 11818, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

Wilcoxon Rank-Signed Test for MR4 Pre- and Post- Event

mean(C2Cagg$Pre_MR4)

## [1] 2.898909

mean(C2Cagg$Post_MR4)

## [1] 3.015073

wilcox.test(C2Cagg$Pre_MR4, C2Cagg$Post_MR4, paired=TRUE)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  C2Cagg$Pre_MR4 and C2Cagg$Post_MR4
## V = 25499, p-value = 2.155e-12
## alternative hypothesis: true location shift is not equal to 0

Wilcoxon Rank-Signed Test for MR5 Pre- and Post- Event

mean(C2Cagg$Pre_MR5)

## [1] 2.683472

mean(C2Cagg$Post_MR5)

## [1] 2.97921

wilcox.test(C2Cagg$Pre_MR5, C2Cagg$Post_MR5, paired=TRUE)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  C2Cagg$Pre_MR5 and C2Cagg$Post_MR5
## V = 27777, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

Two Sample T-Test

The two sample T-Test is used to compare the means of two samples for parametric data.

Comparison of the Means of the Impact on MR1 for 2 Events

We will compare the impact on two events with reference to the aggregated factor MR1 (Stigma).

Compare the means of the MR1 impact for both events

mean(C2Cagg$X_MR1[C2Cagg$EVENT=="12. What A Wonderful World"])

## [1] 1.599223

mean(C2Cagg$X_MR1[C2Cagg$EVENT=="2. Kindred Spirit Series Virtual Challenge"])

## [1] 0.9166667

Perform an F test to compare two variances

MR1.WAWW <- C2Cagg$X_MR1[C2Cagg$EVENT=="12. What A Wonderful World"]
MR1.Kindred <- C2Cagg$X_MR1[C2Cagg$EVENT=="2. Kindred Spirit Series Virtual Challenge"]
var.test(x = MR1.WAWW, y = MR1.Kindred)

## 
##  F test to compare two variances
## 
## data:  MR1.WAWW and MR1.Kindred
## F = 0.65682, num df = 176, denom df = 251, p-value = 0.003031
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.5014253 0.8661124
## sample estimates:
## ratio of variances 
##          0.6568221

Perform a two-sample t-Test for Unequal Variance (Welch)

t.test(x = MR1.WAWW, y = MR1.Kindred, var.equal = F)

## 
##  Welch Two Sample t-test
## 
## data:  MR1.WAWW and MR1.Kindred
## t = 7.7723, df = 418.28, p-value = 6.035e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.5099358 0.8551772
## sample estimates:
## mean of x mean of y 
## 1.5992232 0.9166667

If the Variance was Equal, Perform this Test Instead (Default)

t.test(x = MR1.WAWW, y = MR1.Kindred, var.equal = T)

## 
##  Two Sample t-test
## 
## data:  MR1.WAWW and MR1.Kindred
## t = 7.4952, df = 427, p-value = 3.853e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.5035635 0.8615495
## sample estimates:
## mean of x mean of y 
## 1.5992232 0.9166667

Mann-Whitney U-Test

The Mann-Whitney U-Test is used to compare the means of two samples for non-parametric data.

Comparison of the Means of the Impact on MR3 for 2 Events

We will compare the impact on two events with reference to the aggregated factor MR3 (Health Advocacy).

Compare the means of the MR3 impact for both events

mean(C2Cagg$X_MR3[C2Cagg$EVENT=="12. What A Wonderful World"])

## [1] 2.155932

mean(C2Cagg$X_MR3[C2Cagg$EVENT=="2. Kindred Spirit Series Virtual Challenge"])

## [1] 1.268254

Declare the Variables in Preparation for Mann-Whitney U Test

MR3.WAWW <- C2Cagg$X_MR3[C2Cagg$EVENT=="12. What A Wonderful World"]
MR3.Kindred <- C2Cagg$X_MR3[C2Cagg$EVENT=="2. Kindred Spirit Series Virtual Challenge"]

Perform a Mann-Whitney U Test (basically an Unpaired Wilcox Test)

wilcox.test(MR3.WAWW, MR3.Kindred, paired=FALSE)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  MR3.WAWW and MR3.Kindred
## W = 31454, p-value = 3.636e-13
## alternative hypothesis: true location shift is not equal to 0

One-Way and Two-Way ANOVA

One-Way ANOVA

The one-way analysis of variance (ANOVA) is used to determine whether there are any statistically significant differences between the means of three or more independent (unrelated) groups. This is applicable for parametric data.

In this one-way ANOVA example, we will model the impact on MR1 (stigma) as a function of the type of event participated in.

one.way <- aov(C2Cagg$X_MR1 ~ C2Cagg$EVENT, data = C2Cagg)
summary(one.way)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## C2Cagg$EVENT   6  397.2   66.21   144.8 <2e-16 ***
## Residuals    955  436.6    0.46                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model summary first lists the independent variables being tested in the model (in this case we have only one, ‘C2Cagg$EVENT’ or EVENT) and the model residuals (‘Residual’). All of the variation that is not explained by the independent variables is called residual variance.

The rest of the values in the output table describe the independent variable and the residuals:

The Df column displays the degrees of freedom for the independent variable (the number of levels in the variable minus 1), and the degrees of freedom for the residuals (the total number of observations minus one and minus the number of levels in the independent variables).
The Sum Sq column displays the sum of squares (a.k.a. the total variation between the group means and the overall mean).
The Mean Sq column is the mean of the sum of squares, calculated by dividing the sum of squares by the degrees of freedom for each parameter.
The F-value column is the test statistic from the F test. This is the mean square of each independent variable divided by the mean square of the residuals. The larger the F value, the more likely it is that the variation caused by the independent variable is real and not due to chance.
The Pr(>F) column is the p-value of the F-statistic. This shows how likely it is that the F-value calculated from the test would have occurred if the null hypothesis of no difference among group means were true.

The p-value of the EVENT variable is low (p < 0.001), so it appears that the type of EVENT participated in has a real impact on the mental health stigma (MR1) levels.

Two-Way ANOVA

In this two-way ANOVA example, we will model the impact on MR1 (stigma) as a function of the type of event participated in, as well as the gender of the partipants.

two.way <- aov(C2Cagg$X_MR1 ~ C2Cagg$EVENT + C2Cagg$GENDER, data = C2Cagg)
summary(two.way)

##                Df Sum Sq Mean Sq F value Pr(>F)    
## C2Cagg$EVENT    6  397.2   66.21 145.522 <2e-16 ***
## C2Cagg$GENDER   1    2.6    2.56   5.625 0.0179 *  
## Residuals     954  434.0    0.45                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Adding interactions between variables

Sometimes you have reason to think that two of your independent variables have an interaction effect rather than an additive effect.

interaction <- aov(C2Cagg$X_MR1 ~ C2Cagg$EVENT*C2Cagg$GENDER, data = C2Cagg)
summary(interaction)

##                             Df Sum Sq Mean Sq F value Pr(>F)    
## C2Cagg$EVENT                 6  397.2   66.21 145.642 <2e-16 ***
## C2Cagg$GENDER                1    2.6    2.56   5.630 0.0179 *  
## C2Cagg$EVENT:C2Cagg$GENDER   6    3.1    0.51   1.131 0.3420    
## Residuals                  948  430.9    0.45                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Finding the Best-Fit Model

There are now three different ANOVA models to explain the data. How do you decide which one to use? Usually you’ll want to use the ‘best-fit’ model – the model that best explains the variation in the dependent variable.

The Akaike information criterion (AIC) is a good test for model fit. AIC calculates the information value of each model by balancing the variation explained against the number of parameters used.

In AIC model selection, we compare the information value of each model and choose the one with the lowest AIC value (a lower number means more information explained!)

library(AICcmodavg)
model.set <- list(one.way, two.way, interaction)
model.names <- c("one.way", "two.way", "interaction")
aictab(model.set, modnames = model.names)

## 
## Model selection based on AICc:
## 
##              K    AICc Delta_AICc AICcWt Cum.Wt      LL
## two.way      9 1982.57       0.00   0.81   0.81 -982.19
## one.way      8 1986.19       3.62   0.13   0.95 -985.02
## interaction 15 1988.02       5.46   0.05   1.00 -978.76

The model with the lowest AIC score (listed first in the table) is the best fit for the data, which in this case would be the two.way model.

Check for homoscedasticity

To check whether the model fits the assumption of homoscedasticity, look at the model diagnostic plots in R using the plot() function:

par(mfrow=c(2,2))
plot(two.way)

par(mfrow=c(1,1))

The diagnostic plots show the unexplained variance (residuals) across the range of the observed data.

Each plot gives a specific piece of information about the model fit, but it’s enough to know that the red line representing the mean of the residuals should be horizontal and centered on zero (or on one, in the scale-location plot), meaning that there are no large outliers that would cause bias in the model.

The normal Q-Q plot plots a regression between the theoretical residuals of a perfectly-homoscedastic model and the actual residuals of your model, so the closer to a slope of 1 this is the better. This Q-Q plot is very close, with only a bit of deviation.

From these diagnostic plots we can say that the model fits the assumption of homoscedasticity.

If your model doesn’t fit the assumption of homoscedasticity, you can try the Kruskall-Wallis test instead. This is discussed in the next section.

Do a post-hoc test

ANOVA tells us if there are differences among group means, but not what the differences are. To find out which groups are statistically different from one another, you can perform a Tukey’s Honestly Significant Difference (Tukey’s HSD) post-hoc test for pairwise comparisons:

tukey.two.way<-TukeyHSD(two.way)
tukey.two.way

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = C2Cagg$X_MR1 ~ C2Cagg$EVENT + C2Cagg$GENDER, data = C2Cagg)
## 
## $`C2Cagg$EVENT`
##                                                                                                        diff
## 11. Off Centre Watch Party & Aftershow-10. C2C Social Media Feature                             0.483262885
## 12. What A Wonderful World-10. C2C Social Media Feature                                         1.918302111
## 2. Kindred Spirit Series Virtual Challenge-10. C2C Social Media Feature                         1.235745614
## 3. Bagan 2020 International Friendship Virtual Race-10. C2C Social Media Feature                0.285951739
## 3. Next Hitmaker Dance Competition-10. C2C Social Media Feature                                 1.787828947
## 5. NBFA Sports Model Championships 2020-10. C2C Social Media Feature                            0.001771255
## 12. What A Wonderful World-11. Off Centre Watch Party & Aftershow                               1.435039226
## 2. Kindred Spirit Series Virtual Challenge-11. Off Centre Watch Party & Aftershow               0.752482729
## 3. Bagan 2020 International Friendship Virtual Race-11. Off Centre Watch Party & Aftershow     -0.197311146
## 3. Next Hitmaker Dance Competition-11. Off Centre Watch Party & Aftershow                       1.304566062
## 5. NBFA Sports Model Championships 2020-11. Off Centre Watch Party & Aftershow                 -0.481491630
## 2. Kindred Spirit Series Virtual Challenge-12. What A Wonderful World                          -0.682556497
## 3. Bagan 2020 International Friendship Virtual Race-12. What A Wonderful World                 -1.632350372
## 3. Next Hitmaker Dance Competition-12. What A Wonderful World                                  -0.130473164
## 5. NBFA Sports Model Championships 2020-12. What A Wonderful World                             -1.916530856
## 3. Bagan 2020 International Friendship Virtual Race-2. Kindred Spirit Series Virtual Challenge -0.949793875
## 3. Next Hitmaker Dance Competition-2. Kindred Spirit Series Virtual Challenge                   0.552083333
## 5. NBFA Sports Model Championships 2020-2. Kindred Spirit Series Virtual Challenge             -1.233974359
## 3. Next Hitmaker Dance Competition-3. Bagan 2020 International Friendship Virtual Race          1.501877208
## 5. NBFA Sports Model Championships 2020-3. Bagan 2020 International Friendship Virtual Race    -0.284180484
## 5. NBFA Sports Model Championships 2020-3. Next Hitmaker Dance Competition                     -1.786057692
##                                                                                                        lwr
## 11. Off Centre Watch Party & Aftershow-10. C2C Social Media Feature                             0.12957100
## 12. What A Wonderful World-10. C2C Social Media Feature                                         1.56199019
## 2. Kindred Spirit Series Virtual Challenge-10. C2C Social Media Feature                         0.88893172
## 3. Bagan 2020 International Friendship Virtual Race-10. C2C Social Media Feature               -0.05836414
## 3. Next Hitmaker Dance Competition-10. C2C Social Media Feature                                 0.91234443
## 5. NBFA Sports Model Championships 2020-10. C2C Social Media Feature                           -0.63856954
## 12. What A Wonderful World-11. Off Centre Watch Party & Aftershow                               1.22763141
## 2. Kindred Spirit Series Virtual Challenge-11. Off Centre Watch Party & Aftershow               0.56185300
## 3. Bagan 2020 International Friendship Virtual Race-11. Off Centre Watch Party & Aftershow     -0.38335750
## 3. Next Hitmaker Dance Competition-11. Off Centre Watch Party & Aftershow                       0.47841054
## 5. NBFA Sports Model Championships 2020-11. Off Centre Watch Party & Aftershow                 -1.05253992
## 2. Kindred Spirit Series Virtual Challenge-12. What A Wonderful World                          -0.87800453
## 3. Bagan 2020 International Friendship Virtual Race-12. What A Wonderful World                 -1.82333069
## 3. Next Hitmaker Dance Competition-12. What A Wonderful World                                  -0.95775376
## 5. NBFA Sports Model Championships 2020-12. What A Wonderful World                             -2.48920562
## 3. Bagan 2020 International Friendship Virtual Race-2. Kindred Spirit Series Virtual Challenge -1.12240666
## 3. Next Hitmaker Dance Competition-2. Kindred Spirit Series Virtual Challenge                  -0.27115107
## 5. NBFA Sports Model Championships 2020-2. Kindred Spirit Series Virtual Challenge             -1.80078833
## 3. Next Hitmaker Dance Competition-3. Bagan 2020 International Friendship Virtual Race          0.67969205
## 5. NBFA Sports Model Championships 2020-3. Bagan 2020 International Friendship Virtual Race    -0.84946946
## 5. NBFA Sports Model Championships 2020-3. Next Hitmaker Dance Competition                     -2.76965890
##                                                                                                        upr
## 11. Off Centre Watch Party & Aftershow-10. C2C Social Media Feature                             0.83695477
## 12. What A Wonderful World-10. C2C Social Media Feature                                         2.27461403
## 2. Kindred Spirit Series Virtual Challenge-10. C2C Social Media Feature                         1.58255951
## 3. Bagan 2020 International Friendship Virtual Race-10. C2C Social Media Feature                0.63026762
## 3. Next Hitmaker Dance Competition-10. C2C Social Media Feature                                 2.66331346
## 5. NBFA Sports Model Championships 2020-10. C2C Social Media Feature                            0.64211205
## 12. What A Wonderful World-11. Off Centre Watch Party & Aftershow                               1.64244705
## 2. Kindred Spirit Series Virtual Challenge-11. Off Centre Watch Party & Aftershow               0.94311246
## 3. Bagan 2020 International Friendship Virtual Race-11. Off Centre Watch Party & Aftershow     -0.01126479
## 3. Next Hitmaker Dance Competition-11. Off Centre Watch Party & Aftershow                       2.13072159
## 5. NBFA Sports Model Championships 2020-11. Off Centre Watch Party & Aftershow                  0.08955666
## 2. Kindred Spirit Series Virtual Challenge-12. What A Wonderful World                          -0.48710847
## 3. Bagan 2020 International Friendship Virtual Race-12. What A Wonderful World                 -1.44137005
## 3. Next Hitmaker Dance Competition-12. What A Wonderful World                                   0.69680744
## 5. NBFA Sports Model Championships 2020-12. What A Wonderful World                             -1.34385609
## 3. Bagan 2020 International Friendship Virtual Race-2. Kindred Spirit Series Virtual Challenge -0.77718109
## 3. Next Hitmaker Dance Competition-2. Kindred Spirit Series Virtual Challenge                   1.37531774
## 5. NBFA Sports Model Championships 2020-2. Kindred Spirit Series Virtual Challenge             -0.66716039
## 3. Next Hitmaker Dance Competition-3. Bagan 2020 International Friendship Virtual Race          2.32406237
## 5. NBFA Sports Model Championships 2020-3. Bagan 2020 International Friendship Virtual Race     0.28110849
## 5. NBFA Sports Model Championships 2020-3. Next Hitmaker Dance Competition                     -0.80245648
##                                                                                                    p adj
## 11. Off Centre Watch Party & Aftershow-10. C2C Social Media Feature                            0.0011425
## 12. What A Wonderful World-10. C2C Social Media Feature                                        0.0000000
## 2. Kindred Spirit Series Virtual Challenge-10. C2C Social Media Feature                        0.0000000
## 3. Bagan 2020 International Friendship Virtual Race-10. C2C Social Media Feature               0.1776993
## 3. Next Hitmaker Dance Competition-10. C2C Social Media Feature                                0.0000000
## 5. NBFA Sports Model Championships 2020-10. C2C Social Media Feature                           1.0000000
## 12. What A Wonderful World-11. Off Centre Watch Party & Aftershow                              0.0000000
## 2. Kindred Spirit Series Virtual Challenge-11. Off Centre Watch Party & Aftershow              0.0000000
## 3. Bagan 2020 International Friendship Virtual Race-11. Off Centre Watch Party & Aftershow     0.0293993
## 3. Next Hitmaker Dance Competition-11. Off Centre Watch Party & Aftershow                      0.0000718
## 5. NBFA Sports Model Championships 2020-11. Off Centre Watch Party & Aftershow                 0.1635139
## 2. Kindred Spirit Series Virtual Challenge-12. What A Wonderful World                          0.0000000
## 3. Bagan 2020 International Friendship Virtual Race-12. What A Wonderful World                 0.0000000
## 3. Next Hitmaker Dance Competition-12. What A Wonderful World                                  0.9992353
## 5. NBFA Sports Model Championships 2020-12. What A Wonderful World                             0.0000000
## 3. Bagan 2020 International Friendship Virtual Race-2. Kindred Spirit Series Virtual Challenge 0.0000000
## 3. Next Hitmaker Dance Competition-2. Kindred Spirit Series Virtual Challenge                  0.4271694
## 5. NBFA Sports Model Championships 2020-2. Kindred Spirit Series Virtual Challenge             0.0000000
## 3. Next Hitmaker Dance Competition-3. Bagan 2020 International Friendship Virtual Race         0.0000018
## 5. NBFA Sports Model Championships 2020-3. Bagan 2020 International Friendship Virtual Race    0.7536613
## 5. NBFA Sports Model Championships 2020-3. Next Hitmaker Dance Competition                     0.0000021
## 
## $`C2Cagg$GENDER`
##                   diff        lwr         upr     p adj
## Male-Female -0.1019972 -0.1881175 -0.01587687 0.0203216

Plot the results in a graph

tukey.plot.aov<-aov(C2Cagg$X_MR1 ~ C2Cagg$EVENT:C2Cagg$GENDER, data = C2Cagg)

Instead of printing the TukeyHSD results in a table, we’ll do it in a graph.

tukey.plot.test<-TukeyHSD(tukey.plot.aov)
plot(tukey.plot.test, las = 1)

Kruskal-Wallis H Test

The Kruskal Wallis test in R is a non-parametric method to test whether multiple groups are identically distributed or not.

We will use this example to model the impact on MR3 (advocacy) as a function of the type of event participated in.

kruskal.test(C2Cagg$X_MR3 ~ C2Cagg$EVENT, data = C2Cagg)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  C2Cagg$X_MR3 by C2Cagg$EVENT
## Kruskal-Wallis chi-squared = 489.26, df = 6, p-value < 2.2e-16

We can now carry out a post-hoc test by applying a pairwise Wilcoxon rank test.

pairwise.wilcox.test(C2Cagg$X_MR3, C2Cagg$EVENT, exact=F)

## 
##  Pairwise comparisons using Wilcoxon rank sum test 
## 
## data:  C2Cagg$X_MR3 and C2Cagg$EVENT 
## 
##                                                     10. C2C Social Media Feature
## 11. Off Centre Watch Party & Aftershow              0.01154                     
## 12. What A Wonderful World                          < 2e-16                     
## 2. Kindred Spirit Series Virtual Challenge          1.8e-09                     
## 3. Bagan 2020 International Friendship Virtual Race 0.51179                     
## 3. Next Hitmaker Dance Competition                  0.00128                     
## 5. NBFA Sports Model Championships 2020             1.00000                     
##                                                     11. Off Centre Watch Party & Aftershow
## 11. Off Centre Watch Party & Aftershow              -                                     
## 12. What A Wonderful World                          < 2e-16                               
## 2. Kindred Spirit Series Virtual Challenge          < 2e-16                               
## 3. Bagan 2020 International Friendship Virtual Race 9.1e-07                               
## 3. Next Hitmaker Dance Competition                  0.00058                               
## 5. NBFA Sports Model Championships 2020             0.24189                               
##                                                     12. What A Wonderful World
## 11. Off Centre Watch Party & Aftershow              -                         
## 12. What A Wonderful World                          -                         
## 2. Kindred Spirit Series Virtual Challenge          5.8e-12                   
## 3. Bagan 2020 International Friendship Virtual Race < 2e-16                   
## 3. Next Hitmaker Dance Competition                  0.34067                   
## 5. NBFA Sports Model Championships 2020             4.2e-08                   
##                                                     2. Kindred Spirit Series Virtual Challenge
## 11. Off Centre Watch Party & Aftershow              -                                         
## 12. What A Wonderful World                          -                                         
## 2. Kindred Spirit Series Virtual Challenge          -                                         
## 3. Bagan 2020 International Friendship Virtual Race < 2e-16                                   
## 3. Next Hitmaker Dance Competition                  1.00000                                   
## 5. NBFA Sports Model Championships 2020             0.00064                                   
##                                                     3. Bagan 2020 International Friendship Virtual Race
## 11. Off Centre Watch Party & Aftershow              -                                                  
## 12. What A Wonderful World                          -                                                  
## 2. Kindred Spirit Series Virtual Challenge          -                                                  
## 3. Bagan 2020 International Friendship Virtual Race -                                                  
## 3. Next Hitmaker Dance Competition                  1.4e-07                                            
## 5. NBFA Sports Model Championships 2020             0.75094                                            
##                                                     3. Next Hitmaker Dance Competition
## 11. Off Centre Watch Party & Aftershow              -                                 
## 12. What A Wonderful World                          -                                 
## 2. Kindred Spirit Series Virtual Challenge          -                                 
## 3. Bagan 2020 International Friendship Virtual Race -                                 
## 3. Next Hitmaker Dance Competition                  -                                 
## 5. NBFA Sports Model Championships 2020             0.00784                           
## 
## P value adjustment method: holm

Pearson’s Correlation Test

Pearson correlation (r), which measures a linear dependence between two variables (x and y). It’s also known as a parametric correlation test because it depends to the distribution of the data. It can be used only when x and y are from a normal distribution.

We will look at whether there is a correlation between MR1 and MR3 impact.

Q-Q Plots Visualisation with ggpubr

library(ggpubr)
x = C2Cagg$X_MR1
y = C2Cagg$X_MR3
qqplot(x, y, xlab = "MR1", ylab = "MR3", main = "Q-Q Plot")

Shapiro-Wilk test for Normality

shapiro.test(C2Cagg$X_MR1)

## 
##  Shapiro-Wilk normality test
## 
## data:  C2Cagg$X_MR1
## W = 0.85049, p-value < 2.2e-16

shapiro.test(C2Cagg$X_MR3)

## 
##  Shapiro-Wilk normality test
## 
## data:  C2Cagg$X_MR3
## W = 0.81902, p-value < 2.2e-16

Pearson Correlation Test for Normally Distributed Data

cor.test(C2Cagg$X_MR1, C2Cagg$X_MR3, method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  C2Cagg$X_MR1 and C2Cagg$X_MR3
## t = 49.635, df = 960, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8295650 0.8651133
## sample estimates:
##       cor 
## 0.8482922

Spearman Rank Correlation Test

Spearman Rank Correlation is used to measure the correlation between two ranked variables, and can be applied when the normality test in the previous section fails.

Let’s try this out on the same dataset from the previous section, namely MR1 and MR3.

cor.test(C2Cagg$X_MR1, C2Cagg$X_MR3, method = "spearman")

## Warning in cor.test.default(C2Cagg$X_MR1, C2Cagg$X_MR3, method = "spearman"):
## Cannot compute exact p-value with ties

## 
##  Spearman's rank correlation rho
## 
## data:  C2Cagg$X_MR1 and C2Cagg$X_MR3
## S = 32783963, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.7790531

Decision Trees and Random Forests

In this section, we will continue to use the CommunityᄅCampus dataset to develop predictive models on stigma with decision trees and random forests. These are supervised learning algorithms that can be used for classification or regression.

Brief Introduction to Decision Trees

A decision tree is a supervised machine learning algorithm that can be used for both classification and regression problems. A decision tree is simply a series of sequential decisions made to reach a specific result. We will use the example below to briefly explain what this is about. Let’s understand how this tree works.

First, it checks if the customer has a good credit history. Based on that, it classifies the customer into two groups, i.e., customers with good credit history and customers with bad credit history. Then, it checks the income of the customer and again classifies him/her into two groups. Finally, it checks the loan amount requested by the customer. Based on the outcomes from checking these three features, the decision tree decides if the customer’s loan should be approved or not.

The features/attributes and conditions can change based on the data and complexity of the problem but the overall idea remains the same. So, a decision tree makes a series of decisions based on a set of features/attributes present in the data, which in this case were credit history, income, and loan amount.

Now, you might be wondering:

Why did the decision tree check the credit score first and not the income?

This is known as feature importance and the sequence of attributes to be checked is decided on the basis of criteria like Gini Impurity Index or Information Gain.

An Overview of Random Forest

The decision tree algorithm is quite easy to understand and interpret. But often, a single tree is not sufficient for producing effective results. This is where the Random Forest algorithm comes into the picture.

Random Forest is a tree-based machine learning algorithm that leverages the power of multiple decision trees for making decisions. As the name suggests, it is a “forest” of trees!

But why do we call it a “random” forest? That’s because it is a forest of randomly created decision trees. Each node in the decision tree works on a random subset of features to calculate the output. The random forest then combines the output of individual decision trees to generate the final output.

In simple words:

The Random Forest Algorithm combines the output of multiple (randomly created) Decision Trees to generate the final output.

This process of combining the output of multiple individual models (also known as weak learners) is called Ensemble Learning.

Loading the Data and Packages

This is the code for installation of Pacman which is used to load all packages for this section. You have used it in Section 01, 02 and 03 too.

install.packages("pacman",repos = "http://cran.us.r-project.org")

## Warning: package 'pacman' is in use and will not be installed

Load the packages required for this section

pacman
psych
rio
magrittr
ggplot2
tidyverse
tidymodels
vip
rpart
rpart.plot
ranger

pacman::p_load(pacman, psych, rio, magrittr, ggplot2, tidyverse, tidymodels, vip, rpart, rpart.plot, ranger)

You can now import the source data from the csv file which I have placed online at Github via the link below using the read.csv command.

https://raw.githubusercontent.com/aaron-chen-angus/community2campus/main/C2Cdecisiontree.csv

C2CdTx <- read.csv(file = "https://raw.githubusercontent.com/aaron-chen-angus/community2campus/main/C2Cdecisiontree.csv", header = TRUE, sep = ",")

Check on the output by reading the column names

C2CdTx %>% colnames()

##  [1] "UNQ_ID"        "AGE.RANGE"     "GENDER"        "NCSS_B03A"    
##  [5] "NCSS_B04"      "NCSS_B05"      "PARTICIPATION" "Process_E_LS" 
##  [9] "Process_E_EX"  "Process_E_CT"  "X_MR1"         "X_MR2"        
## [13] "X_MR3"         "X_MR4"         "X_MR5"         "I_MR1"        
## [17] "I_MR2"         "I_MR3"         "I_MR4"         "I_MR5"

Lets now prepare the data that we want to use for the decision tree model.

C2CdT <- C2CdTx%>%
select(AGE.RANGE, GENDER, NCSS_B03A, NCSS_B04, NCSS_B05, PARTICIPATION, Process_E_LS, Process_E_EX, Process_E_CT, I_MR1)

We will now convert some variables to factors so that the process of decision tree construction can carry on as required.

C2CdT$AGE.RANGE <- factor(C2CdT$AGE.RANGE)
C2CdT$GENDER <- factor(C2CdT$GENDER)
C2CdT$NCSS_B03A <- factor(C2CdT$NCSS_B03A)
C2CdT$NCSS_B04 <- factor(C2CdT$NCSS_B04)
C2CdT$NCSS_B05 <- factor(C2CdT$NCSS_B05)
C2CdT$I_MR1 <- factor(C2CdT$I_MR1)

We will be working with the C2CdT data frame in this session. This code shows the first 6 rows of the dataset.

head(C2CdT)

##         AGE.RANGE GENDER NCSS_B03A NCSS_B04            NCSS_B05 PARTICIPATION
## 1 18-24 years old Female        No       No       7 Polytechnic             7
## 2 18-24 years old   Male       Yes       No       7 Polytechnic             6
## 3 18-24 years old Female       Yes       No       7 Polytechnic             4
## 4 18-24 years old Female       Yes       No       7 Polytechnic             3
## 5 18-24 years old Female       Yes       No       7 Polytechnic             9
## 6 18-24 years old Female       Yes       No 8 University Degree            10
##   Process_E_LS Process_E_EX Process_E_CT I_MR1
## 1            5            5            5     N
## 2            5            5            5     N
## 3            5            5            5     Y
## 4            5            5            5     N
## 5            5            5            5     N
## 6            5            5            5     N

A row in this data frame represents a participant in any of the Community2Campus events. Each participant has completed both pre- and post- event surveys which contain the specific data fields.

The response variable in this data is I_MR1 which indicates whether there is a significant positive impact on the aggregated variable MR1 which is a measure of the impact on self- and social stigma associated with mental health.

Decision Trees

Decision Trees

To demonstrate fitting decision trees, we will use the C2CdT data set and predict I_MR1 using all available predictor variables.

A decision tree is specified with the decision_tree() function from tidymodels and has three hyperparameters, cost_complexity, tree_depth, and min_n. Since we will need to perform hyperparameter tuning, we will create cross validation folds from our training data.

Data Splitting

We will split the data into a training and test set. The training data will be further divided into 5 folds for hyperparameter tuning.

set.seed(314) # Remember to always set your seed. Any integer will work

C2CdT_split <- initial_split(C2CdT, prop = 0.75, 
                             strata = I_MR1)

C2CdT_training <- C2CdT_split %>% training()

C2CdT_test <- C2CdT_split %>% testing()

# Create folds for cross validation on the training data set
## These will be used to tune model hyperparameters
set.seed(314)

C2CdT_folds <- vfold_cv(C2CdT_training, v = 5)

Feature Engineering

Now we can create a feature engineering recipe for this data. We will train the following transformations on our training data.

Remove skewness from numeric predictors
Normalize all numeric predictors
Create dummy variables for all nominal predictors

C2CdT_recipe <- recipe(I_MR1 ~ ., data = C2CdT_training) %>% 
                       step_YeoJohnson(all_numeric(), -all_outcomes()) %>% 
                       step_normalize(all_numeric(), -all_outcomes()) %>% 
                       step_dummy(all_nominal(), -all_outcomes())

Let’s check to see if the feature engineering steps have been carried out correctly.

C2CdT_recipe %>% 
  prep() %>% 
  bake(new_data = C2CdT_training)

## # A tibble: 721 x 22
##    PARTICIPATION Process_E_LS Process_E_EX Process_E_CT I_MR1 AGE.RANGE_X18.24.~
##            <dbl>        <dbl>        <dbl>        <dbl> <fct>              <dbl>
##  1       -0.0525        0.735        0.684        0.667 N                      1
##  2       -1.66          0.735        0.684        0.667 N                      1
##  3        0.876         0.735        0.684        0.667 N                      1
##  4        1.36          0.735        0.684        0.667 N                      1
##  5        0.403         0.735        0.684        0.667 N                      1
##  6       -0.0525        0.735        0.684        0.667 N                      1
##  7        0.403         0.735        0.684        0.667 N                      1
##  8        1.36          0.735        0.684        0.667 N                      1
##  9       -1.30          0.735        0.684        0.667 N                      1
## 10        0.876         0.735        0.684        0.667 N                      1
## # ... with 711 more rows, and 16 more variables:
## #   AGE.RANGE_X25.34.years.old <dbl>, AGE.RANGE_X35.44.years.old <dbl>,
## #   AGE.RANGE_X45.54.years.old <dbl>, AGE.RANGE_X55.64.years.old <dbl>,
## #   AGE.RANGE_X65.74.years.old <dbl>, AGE.RANGE_Below.12.years.old <dbl>,
## #   GENDER_Male <dbl>, NCSS_B03A_Yes <dbl>, NCSS_B04_Yes <dbl>,
## #   NCSS_B05_X2.Primary.School <dbl>, NCSS_B05_X3.Lower.Secondary.School <dbl>,
## #   NCSS_B05_X4.Upper.Secondary.School <dbl>, ...

Model Specification

Next, we specify a decision tree classifier with the following hyperparameters:

cost_complexity: The cost complexity parameter (a.k.a. Cp or λ)
tree_depth: The maximum depth of a tree
min_n: The minimum number of data points in a node that are required for the node to be split further.

To specify a decision tree model with tidymodels, we need the decision_tree() function. The hyperparameters of the model are arguments within the decision_tree() function and may be set to specify values.

However, if tuning is required, then each of these parameters must be set to tune().

We will be using the rpart engine since it will allow us to easily make plots of our decision tree models with the rpart.plot() function.

tree_model <- decision_tree(cost_complexity = tune(),
                            tree_depth = tune(),
                            min_n = tune()) %>% 
              set_engine('rpart') %>% 
              set_mode('classification')

Workflow

Next, we combine our model and recipe into a workflow to easily manage the model-building process.

## Create a grid of hyperparameter values to test
tree_workflow <- workflow() %>% 
                 add_model(tree_model) %>% 
                 add_recipe(C2CdT_recipe)

Hyperparameter Tuning

We will perform a grid search on the decision tree hyperparameters and select the best performing model based on the area under the ROC curve during cross validation.

For models with multiple hyperparameters, tidymodels has the grid_regular() function for automatically creating a tuning grid with combinations of suggested hyperparameter values.

The grid_regular() function takes the hyperparameters as arguments and a levels option. The levels option is used to determine the number of values to create for each unique hyperparameter.

The number of rows in the grid will be levels of hyperparameters

For example, with levels = 2 and 3 model hyperparameters, the grid_regular() function will produce a tuning grid with 23 = 8 hyperparameter combinations.

See the example below, where we create the tree_grid object.

## Create a grid of hyperparameter values to test
tree_grid <- grid_regular(cost_complexity(),
                          tree_depth(),
                          min_n(), 
                          levels = 2)

# View grid
tree_grid

## # A tibble: 8 x 3
##   cost_complexity tree_depth min_n
##             <dbl>      <int> <int>
## 1    0.0000000001          1     2
## 2    0.1                   1     2
## 3    0.0000000001         15     2
## 4    0.1                  15     2
## 5    0.0000000001          1    40
## 6    0.1                   1    40
## 7    0.0000000001         15    40
## 8    0.1                  15    40

Tuning Hyperparameters with tune_grid()

To find the optimal combination of hyperparameters from our tuning grid, we will use the tune_grid() function.

This function takes a model object or workflow as the first argument, cross valdidation folds for the second, and a tuning grid data frame as the third argument.

It is always sugguested to use set.seed() before using tune_grid() in order to be able to reproduce the results at a later time.

## Tune decision tree workflow
set.seed(314)

tree_tuning <- tree_workflow %>% 
               tune_grid(resamples = C2CdT_folds,
                         grid = tree_grid)

## Warning: package 'rlang' was built under R version 3.6.3

## Warning: package 'vctrs' was built under R version 3.6.3

To view the results of our hyperparameter tuning, we can use the show_best() function. We must pass the type of performance metric we would like to see into the show_best() function.

From the results below, we see that for each combination of hyperparameters in our grid, tune_grid() fit a decision tree model with that combination 5 times (since we have 5 folds in our cross validation object).

The mean column in the results below indicates the average value of the performance metric that was obtained.

## Tune decision tree workflow
## Show the top 5 best models based on roc_auc metric
tree_tuning %>% show_best('roc_auc')

## # A tibble: 5 x 9
##   cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
##             <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1    0.0000000001         15    40 roc_auc binary     0.828     5  0.0178
## 2    0.0000000001         15     2 roc_auc binary     0.758     5  0.0198
## 3    0.1                  15     2 roc_auc binary     0.653     5  0.0266
## 4    0.1                  15    40 roc_auc binary     0.653     5  0.0266
## 5    0.0000000001          1     2 roc_auc binary     0.627     5  0.0169
## # ... with 1 more variable: .config <fct>

We can use the select_best() model to select the model from our tuning results that had the best overall performance. In the code below, we specify to select the best performing model based on the roc_auc metric.

We can see which model is best by looking at the cost compexity and tree depth which provides the best roc_auc metric. In this case it is the model with cost complexity of 0.0000000001 and tree depth of 15.

## Select best model based on roc_auc
best_tree <- tree_tuning %>% 
             select_best(metric = 'roc_auc')

# View the best tree parameters
best_tree

## # A tibble: 1 x 4
##   cost_complexity tree_depth min_n .config             
##             <dbl>      <int> <int> <fct>               
## 1    0.0000000001         15    40 Preprocessor1_Model7

Finalize Workflow

The last step in hyperparameter tuning is to use finalize_workflow() to add our optimal model to our workflow object.

final_tree_workflow <- tree_workflow %>% 
                       finalize_workflow(best_tree)

Visualize Results

In order to visualize our decision tree model, we will need to manually train our workflow object with the fit() function.

This step is optional, since visualizing the model is not necessary in all applications. However, if the goal is to understand why the model is predicting certain values, then it is recommended.

The next section shows how to fit the model with last_fit() for automatically obtaining test set performance.

Fit the Model

Next we fit our workflow to the training data. This is done by passing our workflow object to the fit() function.

tree_wf_fit <- final_tree_workflow %>% 
               fit(data = C2CdT_training)

Exploring our Trained Model

Once we have trained our model on our training data, we can study variable importance with the vip() function.

The first step is to extract the trained model from our workflow fit, tree_wf fit. This can be done by passing tree_wf fit to the pull_workflow_fit() function.

tree_fit <- tree_wf_fit %>% 
            pull_workflow_fit()

## Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
## Please use `extract_fit_parsnip()` instead.

Variable Importance

Next we pass tree_fit to the vip() function. This will return a ggplot object with the variable importance scores from our model. The importance scores are based on the splitting criteria of the trained decision tree.

From the results below, we can identify which variables are the most important predictors of I_MR1. From the plot, this would include PARTICIPATION, AGE RANGE, EDUCATION LEVEL and Process Indicators for LOGISTICS & EQUIPMENT as well as EVENT PERSONNEL.

vip(tree_fit)

Decision Tree Plot

We can visualize the trained decision tree by using the rpart.plot() function from the rpart.plot package. We must pass the fit object stored within tree_fit into the rpart.plot() function to make a decision tree plot.

When passing a tidymodels object into this function we also need to pass the addition parameter, roundint = FALSE.

This visualization is a tool to communicate the prediction rules of the trained decision tree (the splits in the predictor space).

Many times, the decision tree plot will be quite large and difficult to read. This is the case for our example below. There are specialized packages in R for zooming into regions of a decision tree plot. However, we do not have time to get into the details this semester.

rpart.plot(tree_fit$fit, roundint = FALSE)

Train and Evaluate With last_fit()

Next we fit our final model workflow to the training data and evaluate performance on the test data.

The last_fit() function will fit our workflow to the training data and generate predictions on the test data as defined by our C2CdT_split object.

tree_last_fit <- final_tree_workflow %>% 
                 last_fit(C2CdT_split)

We can view our performance metrics on the test data

tree_last_fit %>% collect_metrics()

## # A tibble: 2 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <fct>               
## 1 accuracy binary         0.743 Preprocessor1_Model1
## 2 roc_auc  binary         0.810 Preprocessor1_Model1

ROC Curve

We can plot the ROC curve to visualize test set performance of our tuned decision tree

tree_last_fit %>% collect_predictions() %>% 
                  roc_curve(truth = I_MR1, estimate = .pred_N) %>% 
                  autoplot()

Confusion Matrix

We also see the number of false negatives and false positives in our test data set by constructing a confusion matrix.

tree_predictions <- tree_last_fit %>% collect_predictions()

conf_mat(tree_predictions, truth = I_MR1, estimate = .pred_class)

##           Truth
## Prediction  N  Y
##          N 97 30
##          Y 32 82

Random Forests

In this section, we will fit a random forest model to the C2CdT data set which we have used earlier on to generate the decision tree. Random forests take decision trees and construct more powerful models in terms of prediction accuracy. The main mechanism that powers this algorithm is repeated sampling (with replacement) of the training data to produce a sequence of decision tree models. These models are then averaged to obtain a single prediction for a given value in the predictor space.

The random forest model selects a random subset of predictor variables for splitting the predictor space in the tree building process. This is done for every iteration of the algorithm, typically 100 to 2,000 times.

Data Spitting and Feature Engineering

We have already split our data into training, test, and cross validation sets as well as trained our feature engineering recipe, C2CdT_recipe. These can be reused in our random forest workflow.

Model Specification

Next, we specify a random forest classifier with the following hyperparameters:

mtry: The number of predictors that will be randomly sampled at each split when creating the tree models
trees: The number of decision trees to fit and ultimately average
min_n: The minimum number of data points in a node that are required for the node to be split further

To specify a random forest model with tidymodels, we need the rand_forest() function. The hyperparameters of the model are arguments within the rand_forest() function and may be set to specific values. However, if tuning is required, then each of these parameters must be set to tune().

We will be using the ranger engine. This engine has an optional importance argument which can be used to track variable importance measures based on the Gini index. In order to make a variable importance plot with vip(), we must add importance = ‘impurity’ inside our set_engine() function.

rf_model <- rand_forest(mtry = tune(),
                        trees = tune(),
                        min_n = tune()) %>% 
            set_engine('ranger', importance = "impurity") %>% 
            set_mode('classification')

Workflow

Next, we combine our model and recipe into a workflow to easily manage the model-building process.

rf_workflow <- workflow() %>% 
               add_model(rf_model) %>% 
               add_recipe(C2CdT_recipe)

Hyperparameter Tuning

Random Grid Search

We will perform a grid search on the random forest hyperparameters and select the best performing model based on the area under the ROC curve during cross validation.

In the previous section, we used grid_regular() to create a grid of hyperparameter values. This created a regualr grid of recommended default values.

Another way to do hyperparameter tuning is by creating a random grid of values. Many studies have shown that this method does better than the regular grid method.

Random grid search is implemented with the grid_random() function in tidymodels. Like grid_regular() it takes a sequence of hyperparameter names to create the grid. It also has a size parameter that specifies the number of random combinations to create.

The mtry() hyperparameter requires a pre-set range of values to test since it cannot exceed the number of columns in our data. When we add this to grid_random() we can pass mtry() into the range_set() function and set a range for the hyperparameter with a numeric vector.

In the code below, we set the range from 2 to 9. This is because we have 10 columns in C2CdT and we would like to test mtry() values somewhere in the middle between 1 and 10, trying to avoid values close to the ends.

When using grid_random(), it is suggested to use set.seed() for reproducability.

## Create a grid of hyperparameter values to test

set.seed(314)

rf_grid <- grid_random(mtry() %>% range_set(c(2, 9)),
                       trees(),
                       min_n(),
                       size = 10)

# View grid
rf_grid

## # A tibble: 10 x 3
##     mtry trees min_n
##    <int> <int> <int>
##  1     7  1644    34
##  2     9  1440    20
##  3     5  1549    31
##  4     4  1734    39
##  5     5   332    11
##  6     4  1064     3
##  7     8   218    37
##  8     3  1304    24
##  9     7   477    32
## 10     8  1621     6

Tuning Hyperparameters with tune_grid()

To find the optimal combination of hyperparameters from our tuning grid, we will use the tune_grid() function.

## Tune random forest workflow
set.seed(314)

rf_tuning <- rf_workflow %>% 
             tune_grid(resamples = C2CdT_folds,
                       grid = rf_grid)

To view the results of our hyperparameter tuning, we can use the show_best() function. We must pass the type of performance metric we would like to see into the show_best() function.

## Show the top 5 best models based on roc_auc metric
rf_tuning %>% show_best('roc_auc')

## # A tibble: 5 x 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <fct>                
## 1     9  1440    20 roc_auc binary     0.855     5  0.0139 Preprocessor1_Model02
## 2     7   477    32 roc_auc binary     0.853     5  0.0167 Preprocessor1_Model09
## 3     5   332    11 roc_auc binary     0.852     5  0.0162 Preprocessor1_Model05
## 4     7  1644    34 roc_auc binary     0.851     5  0.0168 Preprocessor1_Model01
## 5     8   218    37 roc_auc binary     0.851     5  0.0182 Preprocessor1_Model07

We can use the select_best() model to select the model from our tuning results that had the best overall performance. In the code below, we specify to select the best performing model based on the roc_auc metric.

## Select best model based on roc_auc
best_rf <- rf_tuning %>% 
           select_best(metric = 'roc_auc')

# View the best parameters
best_rf

## # A tibble: 1 x 4
##    mtry trees min_n .config              
##   <int> <int> <int> <fct>                
## 1     9  1440    20 Preprocessor1_Model02

Finalize Workflow

The last step in hyperparameter tuning is to use finalize_workflow() to add our optimal model to our workflow object.

final_rf_workflow <- rf_workflow %>% 
                     finalize_workflow(best_rf)

Variable Importance

In order to visualize the variable importance scores of our random forest model, we will need to manually train our workflow object with the fit() function.

Fit the Model

Next we fit our workflow to the training data. This is done by passing our workflow object to the fit() function.

rf_wf_fit <- final_rf_workflow %>% 
             fit(data = C2CdT_training)

Once we have trained our model on our training data, we can study variable importance with the vip() function.

The first step is to extract the trained model from our workflow fit, rf_wf_fit. This can be done by passing rf_wf_fit to the pull_workflow_fit() function.

rf_fit <- rf_wf_fit %>% 
          pull_workflow_fit()

## Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
## Please use `extract_fit_parsnip()` instead.

Variable Importance

Next we pass rf_fit to the vip() function. This will return a ggplot object with the variable importance scores from our model. The importance scores are based on the randomly chosen predictor variables, through the mtry() hyperparameter, that had the greatest predictive power.

vip(rf_fit)

Similar to the decision tree model, the most important variables for predicting of impact on MR1 are PARTICIPATION, EDUCATION LEVEL, AGE RANGE, as well as Process Indicators on EQUIPMENT & LOGISTICS as well as PROGRAMME PERSONNEL.

Train and Evaluate With last_fit()

Next we fit our final model workflow to the training data and evaluate performance on the test data.

The last_fit() function will fit our workflow to the training data and generate predictions on the test data as defined by our C2CdT_split object.

rf_last_fit <- final_rf_workflow %>% 
               last_fit(C2CdT_split)

We can view our performance metrics on the test data

rf_last_fit %>% collect_metrics()

## # A tibble: 2 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <fct>               
## 1 accuracy binary         0.776 Preprocessor1_Model1
## 2 roc_auc  binary         0.822 Preprocessor1_Model1

ROC Curve

We can plot the ROC curve to visualize test set performance of our random forest model.

rf_last_fit %>% collect_predictions() %>% 
                roc_curve(truth = I_MR1, estimate = .pred_N) %>% 
                autoplot()

Confusion Matrix

We can see the number of false negatives and false positives in our random forest model, performed on our test data set, and whether it outperforms the decision tree model created earlier.

rf_predictions <- rf_last_fit %>% collect_predictions()

conf_mat(rf_predictions, truth = I_MR1, estimate = .pred_class)

##           Truth
## Prediction   N   Y
##          N 109  34
##          Y  20  78

We will now proceed to look into the use of the qualitative datasets for text analytics and sentiment analysis

Introduction to Text Mining and Sentiment Analysis in R

This last section in the Journal Club will introduce you to some basics of text mining and sentiment analysis in R. For this, we will be working with an extract of qualitative feedback data from participants of various CommunityᄅCampus events in the past year.

We will focus on the following tasks in this section

Reading text data into R
Data cleaning and transformation
Identification of prominent themes occurring in the text through word frequency tables & word clouds
Determining context of the prominent themes with Word association analysis using correlation
Generation of sentiment scores using the 4 main methods [syuzhet, bing, afinn, nrc]
Interpretation of the average sentiment [positivity or negativity] in the text
Implementation of emotion classification with NRC sentiment and creation of plots to to analyze and interpret emotions found in the text.

Loading the Data and Packages

This is the code for installation of Pacman which is used to load all packages for this section. You have used it in all preceding sections (01 to 04).

install.packages("pacman",repos = "http://cran.us.r-project.org")

## Warning: package 'pacman' is in use and will not be installed

Aside from the previous packages loaded previously (pacman, psych, rio, magrittr, ggplot2, tidyverse) you will also be using the following packages in this section.

tm : for text mining operations like removing numbers, special characters, punctuations and stop words (Stop words in any language are the most commonly occurring words that have very little value for NLP and should be filtered out. Examples of stop words in English are “the”, “is”, “are”.)
SnowballC : for stemming, which is the process of reducing words to their base or root form. For example, a stemming algorithm would reduce the words “fishing”, “fished” and “fisher” to the stem “fish”.
wordcloud : for generating the word cloud plot.
RColorBrewer : for color palettes used in various plots
syuzhet - for sentiment scores and emotion classification

pacman::p_load(pacman, psych, rio, magrittr, ggplot2, tidyverse, tm, SnowballC, wordcloud, RColorBrewer, syuzhet)

I have placed the source file for at Github for the purpose of conducting the text mining and sentiment analysis for this section at the link below. This time, the file is a .txt file instead of a .csv file which was used in the preceding sections.

https://raw.githubusercontent.com/aaron-chen-angus/community2campus/main/RawDataForTextAnalytics.txt

To load the file for analysis, we will use the readLines() function in base R.

# Read the text file from the file at GitHub
text <- readLines("https://raw.githubusercontent.com/aaron-chen-angus/community2campus/main/RawDataForTextAnalytics.txt")
# Load the data as a corpus
TextDoc <- Corpus(VectorSource(text))

Cleaning up Text Data

Cleaning the text data starts with making transformations like removing special characters from the text. This is done using the tm_map() function to replace special characters like /, @ and | with a space. The next step is to remove the unnecessary whitespace and convert the text to lower case.

Then remove the stopwords. They are the most commonly occurring words in a language and have very little value in terms of gaining useful information. They should be removed before performing further analysis. Examples of stopwords in English are “the, is, at, on”. There is no single universal list of stop words used by all NLP tools. stopwords in the tm_map() function supports several languages like English, French, German, Italian, and Spanish. Please note the language names are case sensitive. I will also demonstrate how to add your own list of stopwords, which is useful in this Community2Campus example for removing non-default stop words like “team”, “company”, “health”. Next, remove numbers and punctuation.

The last step is text stemming. It is the process of reducing the word to its root form. The stemming process simplifies the word to its common origin. For example, the stemming process reduces the words “fishing”, “fished” and “fisher” to its stem “fish”. Please note stemming uses the SnowballC package.

#Replacing "/", "@" and "|" with space
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
TextDoc <- tm_map(TextDoc, toSpace, "/")

## Warning in tm_map.SimpleCorpus(TextDoc, toSpace, "/"): transformation drops
## documents

TextDoc <- tm_map(TextDoc, toSpace, "@")

## Warning in tm_map.SimpleCorpus(TextDoc, toSpace, "@"): transformation drops
## documents

TextDoc <- tm_map(TextDoc, toSpace, "\\|")

## Warning in tm_map.SimpleCorpus(TextDoc, toSpace, "\\|"): transformation drops
## documents

# Convert the text to lower case
TextDoc <- tm_map(TextDoc, content_transformer(tolower))

## Warning in tm_map.SimpleCorpus(TextDoc, content_transformer(tolower)):
## transformation drops documents

# Remove numbers
TextDoc <- tm_map(TextDoc, removeNumbers)

## Warning in tm_map.SimpleCorpus(TextDoc, removeNumbers): transformation drops
## documents

# Remove english common stopwords
TextDoc <- tm_map(TextDoc, removeWords, stopwords("english"))

## Warning in tm_map.SimpleCorpus(TextDoc, removeWords, stopwords("english")):
## transformation drops documents

# Remove your own stop word
# specify your custom stopwords as a character vector
TextDoc <- tm_map(TextDoc, removeWords, c("s", "company", "team"))

## Warning in tm_map.SimpleCorpus(TextDoc, removeWords, c("s", "company", "team")):
## transformation drops documents

# Remove punctuations
TextDoc <- tm_map(TextDoc, removePunctuation)

## Warning in tm_map.SimpleCorpus(TextDoc, removePunctuation): transformation drops
## documents

# Eliminate extra white spaces
TextDoc <- tm_map(TextDoc, stripWhitespace)

## Warning in tm_map.SimpleCorpus(TextDoc, stripWhitespace): transformation drops
## documents

# Text stemming - which reduces words to their root form
TextDoc <- tm_map(TextDoc, stemDocument)

## Warning in tm_map.SimpleCorpus(TextDoc, stemDocument): transformation drops
## documents

Building the term document matrix

After cleaning the text data, the next step is to count the occurrence of each word, to identify popular or trending topics. Using the function TermDocumentMatrix() from the text mining package, you can build a Document Matrix – a table containing the frequency of words.

The following codes will help provide at a glance the top 5 most frequently found words in your text.

# Build a term-document matrix
TextDoc_dtm <- TermDocumentMatrix(TextDoc)
dtm_m <- as.matrix(TextDoc_dtm)

# Sort by descearing value of frequency
dtm_v <- sort(rowSums(dtm_m),decreasing=TRUE)
dtm_d <- data.frame(word = names(dtm_v),freq=dtm_v)

# Display the top 5 most frequent words
head(dtm_d, 5)

##          word freq
## good     good  125
## work     work  119
## health health   92
## feel     feel   89
## improv improv   69

We can also visualise this by plotting the top 5 most frequent words using a bar chart.

# Plot the most frequent words
barplot(dtm_d[1:5,]$freq, las = 2, names.arg = dtm_d[1:5,]$word,
        col ="lightgreen", main ="Top 5 most frequent words",
        ylab = "Word frequencies")

We could interpret the following from the bar chart above :

The most frequently occurring word is “good”. Also notice that negative words like “not” don’t feature in the bar chart, which indicates there are no negative prefixes to change the context or meaning of the word “good” ( In short, this indicates most responses don’t mention negative phrases like “not good”).
“work”, “health” and “feel” are the next three most frequently occurring words, which indicate that most people feel good about their work and their team’s health.
Finally, the root “improv” for words like “improve”, “improvement”, “improving”, etc. is also on the chart, and you need further analysis to infer if its context is positive or negative

Generate the Word Cloud

A word cloud is one of the most popular ways to visualize and analyze qualitative data. It is an image composed of keywords found within a body of text, where the size of each word indicates its frequency in that body of text.

Use the word frequency data frame (table) created previously to generate the word cloud.

Below is a brief description of the arguments used in the word cloud function

words : words to be plotted
freq – frequencies of words
min.freq : words whose frequency is at or above this threshold value is plotted (in this case, I have set it to 5)
max.words : the maximum number of words to display on the plot (in the code below, I have set it 100)
random.order : I have set it to FALSE, so the words are plotted in order of decreasing frequency
rot.per : the percentage of words that are displayed as vertical text (with 90-degree rotation). I have set it 0.40 (40 %), please feel free to adjust this setting to suit your preferences
colors : changes word colors going from lowest to highest frequencies

You can see the resulting word cloud below.

#generate word cloud
set.seed(1234)
wordcloud(words = dtm_d$word, freq = dtm_d$freq, min.freq = 5,
          max.words=100, random.order=FALSE, rot.per=0.40, 
          colors=brewer.pal(8, "Dark2"))

The word cloud shows additional words that occur frequently and could be of interest for further analysis. Words like “need”, “support”, “issu” (root for “issue(s)”, etc. could provide more context around the most frequently occurring words and help to gain a better understanding of the main themes.

Word Association

Correlation is a statistical technique that can demonstrate whether, and how strongly, pairs of variables are related. This technique can be used effectively to analyze which words occur most often in association with the most frequently occurring words in the survey responses, which helps to see the context around these words

# Find associations 
findAssocs(TextDoc_dtm, terms = c("good","work","health"), corlimit = 0.25)

## $good
##  integr synergi 
##    0.28    0.28 
## 
## $work
## togeth 
##    0.4 
## 
## $health
##    declin    happen      noth      real sentiment    suppli      wors 
##      0.29      0.29      0.29      0.29      0.29      0.29      0.29

This script shows which words are most frequently associated with the top three terms (corlimit = 0.25 is the lower limit/threshold I have set. You can set it lower to see more words, or higher to see less). The output indicates that “integr” (which is the root for word “integrity”) and “synergi” (which is the root for words “synergy”, “synergies”, etc.) and occur 28% of the time with the word “good”. You can interpret this as the context around the most frequently occurring word (“good”) is positive. Similarly, the root of the word “together” is highly correlated with the word “work”. This indicates that most responses are saying that teams “work together” and can be interpreted in a positive context.

The script can also be modified to find terms associated with words that occur at least 50 times or more, instead of having to hard code the terms in your script.

# Find associations for words that occur at least 50 times
findAssocs(TextDoc_dtm, terms = findFreqTerms(TextDoc_dtm, lowfreq = 50), corlimit = 0.25)

## $work
## togeth 
##    0.4 
## 
## $good
##  integr synergi 
##    0.28    0.28 
## 
## $health
##    declin    happen      noth      real sentiment    suppli      wors 
##      0.29      0.29      0.29      0.29      0.29      0.29      0.29 
## 
## $overal
##  bad 
## 0.26 
## 
## $great
##   journey satisfact     march      goal     pursu    toward      hard 
##      0.52      0.52      0.36      0.35      0.28      0.26      0.26 
## 
## $feel
##   across    board    harsh   system somewhat 
##     0.33     0.32     0.32     0.32     0.29 
## 
## $improv
##    room perfect   propl    thik attitud 
##    0.41    0.35    0.35    0.35    0.32

Sentiment Scores

Sentiments can be classified as positive, neutral or negative. They can also be represented on a numeric scale, to better express the degree of positive or negative strength of the sentiment contained in a body of text.

This example uses the syuzhet package for generating sentiment scores, which has four sentiment dictionaries and offers a method for accessing the sentiment extraction tool developed in the NLP group at Stanford.

The get_sentiment function accepts two arguments: a character vector (of sentences or words) and a method. The selected method determines which of the four available sentiment extraction methods will be used. The four methods are syuzhet (this is the default), bing, afinn and nrc. Each method uses a different scale and hence returns slightly different results. Please note the outcome of nrc method is more than just a numeric score, requires additional interpretations and is out of scope for this article.

The descriptions of the get_sentiment function has been sourced from the R website : https://cran.r-project.org/web/packages/syuzhet/vignettes/syuzhet-vignette.html?

# regular sentiment score using get_sentiment() function and method of your choice
# please note that different methods may have different scales
syuzhet_vector <- get_sentiment(text, method="syuzhet")

# see the first row of the vector
head(syuzhet_vector)

## [1] 2.60 4.65 2.55 1.05 1.00 0.25

# see summary statistics of the vector
summary(syuzhet_vector)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -1.450   0.900   1.600   1.883   2.650   9.000

An inspection of the syuzhet vector shows the first element has the value of 2.60. It means the sum of the sentiment scores of all meaningful words in the first response(line) in the text file, adds up to 2.60. The scale for sentiment scores using the syuzhet method is decimal and ranges from -1(indicating most negative) to +1(indicating most positive). Note that the summary statistics of the suyzhet vector show a median value of 1.6, which is above zero and can be interpreted as the overall average sentiment across all the responses is positive.

Next, we will run the same analysis for the next two methods and inspect their respective vectors.

# bing
bing_vector <- get_sentiment(text, method="bing")
head(bing_vector)

## [1]  3  1  4 -1  1  1

summary(bing_vector)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -3.000   1.000   2.000   2.007   3.000   9.000

#affin
afinn_vector <- get_sentiment(text, method="afinn")
head(afinn_vector)

## [1] 4 8 6 5 6 2

summary(afinn_vector)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -6.00    2.00    4.00    4.42    7.00   18.00

Please note the scale of sentiment scores generated by:

bing – binary scale with -1 indicating negative and +1 indicating positive sentiment
afinn – integer scale ranging from -5 to +5

The summary statistics of bing and afinn vectors also show that the Median value of Sentiment scores is above 0 and can be interpreted as the overall average sentiment across the all the responses is positive.

Because these different methods use different scales, it’s better to convert their output to a common scale before comparing them.

This basic scale conversion can be done easily using R’s built-in sign function, which converts all positive number to 1, all negative numbers to -1 and all zeros remain 0.

#compare the first row of each vector using sign function
rbind(
  sign(head(syuzhet_vector)),
  sign(head(bing_vector)),
  sign(head(afinn_vector))
)

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    1    1    1    1    1
## [2,]    1    1    1   -1    1    1
## [3,]    1    1    1    1    1    1

Note the first element of each row (vector) is 1, indicating that all three methods have calculated a positive sentiment score, for the first response (line) in the text.

Emotion Classification

Emotion classification is built on the NRC Word-Emotion Association Lexicon (aka EmoLex). The definition of “NRC Emotion Lexicon”, sourced from >http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm> is “The NRC Emotion Lexicon is a list of English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). The annotations were manually done by crowdsourcing.”

To understand this, explore the get_nrc_sentiments function, which returns a data frame with each row representing a sentence from the original file.

The data frame has ten columns (one column for each of the eight emotions, one column for positive sentiment valence and one for negative sentiment valence).

The data in the columns (anger, anticipation, disgust, fear, joy, sadness, surprise, trust, negative, positive) can be accessed individually or in sets.

The definition of get_nrc_sentiments has been sourced from the R website: https://cran.r-project.org/web/packages/syuzhet/vignettes/syuzhet-vignette.html?

# run nrc sentiment analysis to return data frame with each row classified as one of the following emotions, rather than a score: 
# anger, anticipation, disgust, fear, joy, sadness, surprise, trust 
# It also counts the number of positive and negative emotions found in each row
d <- get_nrc_sentiment(text)

# head(d,10) - to see top 10 lines of the get_nrc_sentiment dataframe
head (d,10)

##    anger anticipation disgust fear joy sadness surprise trust negative positive
## 1      0            1       0    0   1       0        0     2        1        2
## 2      0            3       0    1   0       0        0     1        1        5
## 3      0            1       0    0   1       0        0     1        0        2
## 4      0            3       0    0   2       1        1     3        2        4
## 5      0            2       0    0   2       0        1     4        1        3
## 6      0            0       0    0   0       0        0     0        0        1
## 7      0            2       0    0   2       0        0     4        0        6
## 8      0            4       0    0   4       0        1     4        0        5
## 9      0            3       0    0   3       0        1     3        0        5
## 10     1            1       0    1   0       0        1     1        1        3

The output shows that the first line of text has;

Zero occurrences of words associated with emotions of anger, disgust, fear, sadness and surprise
One occurrence each of words associated with emotions of anticipation and joy
Two occurrences of words associated with emotions of trust
Total of one occurrence of words associated with negative emotions
Total of two occurrences of words associated with positive emotions

The next step is to create two plots charts to help visually analyze the emotions in this survey text. First, we can perform some data transformation and clean-up steps before plotting charts.

The first plot shows the total number of instances of words in the text, associated with each of the eight emotions.

#transpose
td<-data.frame(t(d))

#The function rowSums computes column sums across rows for each level of a grouping variable.
td_new <- data.frame(rowSums(td[2:253]))

#Transformation and cleaning
names(td_new)[1] <- "count"
td_new <- cbind("sentiment" = rownames(td_new), td_new)
rownames(td_new) <- NULL
td_new2<-td_new[1:8,]

#Plot One - count of words associated with each sentiment
quickplot(sentiment, data=td_new2, weight=count, geom="bar", fill=sentiment, ylab="count")+ggtitle("Survey sentiments")

This bar chart demonstrates that words associated with the positive emotion of “trust” occurred about five hundred times in the text, whereas words associated with the negative emotion of “disgust” occurred less than 25 times.

A deeper understanding of the overall emotions occurring in the survey response can be gained by comparing these number as a percentage of the total number of meaningful words.

#Plot Two - count of words associated with each sentiment, expressed as a percentage
barplot(
  sort(colSums(prop.table(d[, 1:8]))), 
  horiz = TRUE, 
  cex.names = 0.7, 
  las = 1, 
  main = "Emotions in Text", xlab="Percentage"
)

This bar plot allows for a quick and easy comparison of the proportion of words associated with each emotion in the text. The emotion “trust” has the longest bar and shows that words associated with this positive emotion constitute just over 35% of all the meaningful words in this text.

On the other hand, the emotion of “disgust” has the shortest bar and shows that words associated with this negative emotion constitute less than 2% of all the meaningful words in this text. Overall, words associated with the positive emotions of “trust” and “joy” account for almost 60% of the meaningful words in the text, which can be interpreted as a good sign of team health.

This brings us to the end of this Journal Club Presentation