S3729C Data Analytics Seminar

03 Data Modelling and Analysis

Class on 28 August 2021

Data Modelling and Analysis

In this section, we will continue to use the Community2Campus dataset which was used in the previous section to carry out the following analyses.

Paired T-Test
Wilcoxon Rank-Signed Test
Two Sample T-Test
Mann-Whitney U-Test
One-Way and Two-Way ANOVA
Kruskal-Wallis H Test
Pearson’s Correlation Test
Spearman’s Rank Correlation Test

Enjoy !

Loading the Data and Packages

This is the code for installation of Pacman which is used to load all packages for this section. You have used it in Section 01 and 02 too.

install.packages("pacman",repos = "http://cran.us.r-project.org")

## package 'pacman' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\aaron_chen_angus\AppData\Local\Temp\RtmpCgRKXg\downloaded_packages

Load the packages required for this section

pacman: for loading/unloading packages
psych: for psychometric functions
rio: for importing data
tidyverse: for data wrangling and visualisation functions
dplyr: specifically for data wrangling
ggplot2: specifically for ggplot functions
ggpubr: for enhancement of ggplot outputs for publication
AICcmodavg : for finding the best-fit ANOVA model

pacman::p_load(pacman, psych, rio, tidyverse, dplyr, magrittr, ggplot2, devtools, ggpubr, AICcmodavg, car)

## Error in get(genname, envir = envir) : object 'testthat_print' not found

You can now import the source data from the csv file which I have placed online at Github via the link below using the read.csv command.

https://raw.githubusercontent.com/aaron-chen-angus/community2campus/main/C2CsurveyAggregated.csv

C2Cagg <- read.csv(file = "https://raw.githubusercontent.com/aaron-chen-angus/community2campus/main/C2CsurveyAggregated.csv", header = TRUE, sep = ",")

Check on the output by reading the column names

C2Cagg %>% colnames()

##   [1] "UNQ_ID"        "AGE.RANGE"     "GENDER"        "NCSS_B03A"    
##   [5] "NCSS_B03B"     "NCSS_B04"      "NCSS_B05"      "EVENT"        
##   [9] "PARTICIPATION" "PART_ROLE"     "AFFILIATION"   "Process_E_LS" 
##  [13] "Process_E_EX"  "Process_E_CT"  "Pre_MR1"       "Post_MR1"     
##  [17] "X_MR1"         "Pre_MR2"       "Post_MR2"      "X_MR2"        
##  [21] "Pre_MR3"       "Post_MR3"      "X_MR3"         "Pre_MR4"      
##  [25] "Post_MR4"      "X_MR4"         "Pre_MR5"       "Post_MR5"     
##  [29] "X_MR5"         "Pre_AT_PT1"    "Pre_AT_PT2"    "Pre_AT_PT3"   
##  [33] "Pre_AT_WT5"    "Pre_AT_CT1"    "Pre_AT_CT2"    "Pre_AT_CT3"   
##  [37] "Pre_AT_CT4"    "Pre_AT_LA1"    "Pre_AT_LA2"    "Pre_AT_LA3"   
##  [41] "Pre_AT_SC6"    "Pre_ST_SD1"    "Pre_ST_SD2"    "Pre_ST_SD3"   
##  [45] "Pre_ST_SD4"    "Pre_NCSS_P2"   "Pre_AT_WT2"    "Pre_AT_WT3"   
##  [49] "Pre_AT_WT4"    "Pre_AT_LA4"    "Pre_AT_WT1"    "Pre_AT_CT5"   
##  [53] "Pre_AT_LA5"    "Pre_AT_SC2"    "Pre_ST_SD5"    "Pre_ST_SR2"   
##  [57] "Pre_ST_SR1"    "Pre_ST_SR3"    "Pre_ST_SR4"    "Pre_ST_SR5"   
##  [61] "Pre_NCSS_P1"   "Pre_NCSS_P3"   "Pre_NCSS_P4"   "Pre_NCSS_P5"  
##  [65] "Pre_AT_SC1"    "Pre_AT_SC3"    "Pre_AT_SC4"    "Pre_AT_SC5"   
##  [69] "Post_AT_PT1"   "Post_AT_PT2"   "Post_AT_PT3"   "Post_AT_WT5"  
##  [73] "Post_AT_CT1"   "Post_AT_CT2"   "Post_AT_CT3"   "Post_AT_CT4"  
##  [77] "Post_AT_LA1"   "Post_AT_LA2"   "Post_AT_LA3"   "Post_ST_SD1"  
##  [81] "Post_ST_SD2"   "Post_ST_SD3"   "Post_ST_SD4"   "Post_AT_SC6"  
##  [85] "Post_NCSS_P2"  "Post_AT_WT2"   "Post_AT_WT3"   "Post_AT_WT4"  
##  [89] "Post_AT_LA4"   "Post_AT_WT1"   "Post_AT_CT5"   "Post_AT_LA5"  
##  [93] "Post_AT_SC2"   "Post_ST_SD5"   "Post_ST_SR2"   "Post_ST_SR1"  
##  [97] "Post_ST_SR3"   "Post_ST_SR4"   "Post_ST_SR5"   "Post_NCSS_P1" 
## [101] "Post_NCSS_P3"  "Post_NCSS_P4"  "Post_NCSS_P5"  "Post_AT_SC1"  
## [105] "Post_AT_SC3"   "Post_AT_SC4"   "Post_AT_SC5"

Paired T-Test

The paired T-Test is a very useful statistical technique to compare pre- and post- data in a population. This is applicable when we have parametric data.

We will now carry out a paired T-Test (with the assumtion of normality) on the Pre- vs Post- values for the Aggregated Factors in the C2C survey.

Paired T-Test for MR1 Pre- and Post- Event

mean(C2Cagg$Pre_MR1)

## [1] 2.663851

mean(C2Cagg$Post_MR1)

## [1] 3.213682

t.test(x = C2Cagg$Pre_MR1, y = C2Cagg$Post_MR1, 
paired = T)

## 
##  Paired t-test
## 
## data:  C2Cagg$Pre_MR1 and C2Cagg$Post_MR1
## t = -18.308, df = 961, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6087675 -0.4908946
## sample estimates:
## mean of the differences 
##              -0.5498311

Paired T-Test for MR2 Pre- and Post- Event

mean(C2Cagg$Pre_MR2)

## [1] 2.834511

mean(C2Cagg$Post_MR2)

## [1] 2.859459

t.test(x = C2Cagg$Pre_MR2, y = C2Cagg$Post_MR2, 
paired = T)

## 
##  Paired t-test
## 
## data:  C2Cagg$Pre_MR2 and C2Cagg$Post_MR2
## t = -1.7998, df = 961, p-value = 0.07221
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.052151106  0.002255056
## sample estimates:
## mean of the differences 
##             -0.02494802

Paired T-Test for MR3 Pre- and Post- Event

mean(C2Cagg$Pre_MR3)

## [1] 2.926819

mean(C2Cagg$Post_MR3)

## [1] 3.68711

t.test(x = C2Cagg$Pre_MR3, y = C2Cagg$Post_MR3, 
paired = T)

## 
##  Paired t-test
## 
## data:  C2Cagg$Pre_MR3 and C2Cagg$Post_MR3
## t = -20.62, df = 961, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.8326508 -0.6879313
## sample estimates:
## mean of the differences 
##              -0.7602911

Paired T-Test for MR4 Pre- and Post- Event

mean(C2Cagg$Pre_MR4)

## [1] 2.898909

mean(C2Cagg$Post_MR4)

## [1] 3.015073

t.test(x = C2Cagg$Pre_MR4, y = C2Cagg$Post_MR4, 
paired = T)

## 
##  Paired t-test
## 
## data:  C2Cagg$Pre_MR4 and C2Cagg$Post_MR4
## t = -6.9026, df = 961, p-value = 9.265e-12
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.14919031 -0.08313818
## sample estimates:
## mean of the differences 
##              -0.1161642

Paired T-Test for MR5 Pre- and Post- Event

mean(C2Cagg$Pre_MR5)

## [1] 2.683472

mean(C2Cagg$Post_MR5)

## [1] 2.97921

t.test(x = C2Cagg$Pre_MR5, y = C2Cagg$Post_MR5, 
paired = T)

## 
##  Paired t-test
## 
## data:  C2Cagg$Pre_MR5 and C2Cagg$Post_MR5
## t = -12.255, df = 961, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3430950 -0.2483811
## sample estimates:
## mean of the differences 
##               -0.295738

Wilcoxon Rank-Signed Test

The Wilcoxon Rank-Signed Test is a used to compare pre- and post- data when we have non-parametric data.

We will now carry out a Wilcoxon Rank-Signed Test (with the assumtion of non-normality) on the Pre- vs Post- values for the Aggregated Factors in the C2C survey.

Wilcoxon Rank-Signed Test for MR1 Pre- and Post- Event

mean(C2Cagg$Pre_MR1)

## [1] 2.663851

mean(C2Cagg$Post_MR1)

## [1] 3.213682

wilcox.test(C2Cagg$Pre_MR1, C2Cagg$Post_MR1, paired=TRUE)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  C2Cagg$Pre_MR1 and C2Cagg$Post_MR1
## V = 19053, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

Wilcoxon Rank-Signed Test for MR2 Pre- and Post- Event

mean(C2Cagg$Pre_MR2)

## [1] 2.834511

mean(C2Cagg$Post_MR2)

## [1] 2.859459

wilcox.test(C2Cagg$Pre_MR2, C2Cagg$Post_MR2, paired=TRUE)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  C2Cagg$Pre_MR2 and C2Cagg$Post_MR2
## V = 52749, p-value = 0.02623
## alternative hypothesis: true location shift is not equal to 0

Wilcoxon Rank-Signed Test for MR3 Pre- and Post- Event

mean(C2Cagg$Pre_MR3)

## [1] 2.926819

mean(C2Cagg$Post_MR3)

## [1] 3.68711

wilcox.test(C2Cagg$Pre_MR3, C2Cagg$Post_MR3, paired=TRUE)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  C2Cagg$Pre_MR3 and C2Cagg$Post_MR3
## V = 11818, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

Wilcoxon Rank-Signed Test for MR4 Pre- and Post- Event

mean(C2Cagg$Pre_MR4)

## [1] 2.898909

mean(C2Cagg$Post_MR4)

## [1] 3.015073

wilcox.test(C2Cagg$Pre_MR4, C2Cagg$Post_MR4, paired=TRUE)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  C2Cagg$Pre_MR4 and C2Cagg$Post_MR4
## V = 25499, p-value = 2.155e-12
## alternative hypothesis: true location shift is not equal to 0

Wilcoxon Rank-Signed Test for MR5 Pre- and Post- Event

mean(C2Cagg$Pre_MR5)

## [1] 2.683472

mean(C2Cagg$Post_MR5)

## [1] 2.97921

wilcox.test(C2Cagg$Pre_MR5, C2Cagg$Post_MR5, paired=TRUE)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  C2Cagg$Pre_MR5 and C2Cagg$Post_MR5
## V = 27777, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

Two Sample T-Test

The two sample T-Test is used to compare the means of two samples for parametric data.

Comparison of the Means of the Impact on MR1 for 2 Events

We will compare the impact on two events with reference to the aggregated factor MR1 (Stigma).

Compare the means of the MR1 impact for both events

mean(C2Cagg$X_MR1[C2Cagg$EVENT=="12. What A Wonderful World"])

## [1] 1.599223

mean(C2Cagg$X_MR1[C2Cagg$EVENT=="2. Kindred Spirit Series Virtual Challenge"])

## [1] 0.9166667

Perform an F test to compare two variances

MR1.WAWW <- C2Cagg$X_MR1[C2Cagg$EVENT=="12. What A Wonderful World"]
MR1.Kindred <- C2Cagg$X_MR1[C2Cagg$EVENT=="2. Kindred Spirit Series Virtual Challenge"]
var.test(x = MR1.WAWW, y = MR1.Kindred)

## 
##  F test to compare two variances
## 
## data:  MR1.WAWW and MR1.Kindred
## F = 0.65682, num df = 176, denom df = 251, p-value = 0.003031
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.5014253 0.8661124
## sample estimates:
## ratio of variances 
##          0.6568221

Perform a two-sample t-Test for Unequal Variance (Welch)

t.test(x = MR1.WAWW, y = MR1.Kindred, var.equal = F)

## 
##  Welch Two Sample t-test
## 
## data:  MR1.WAWW and MR1.Kindred
## t = 7.7723, df = 418.28, p-value = 6.035e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.5099358 0.8551772
## sample estimates:
## mean of x mean of y 
## 1.5992232 0.9166667

If the Variance was Equal, Perform this Test Instead (Default)

t.test(x = MR1.WAWW, y = MR1.Kindred, var.equal = T)

## 
##  Two Sample t-test
## 
## data:  MR1.WAWW and MR1.Kindred
## t = 7.4952, df = 427, p-value = 3.853e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.5035635 0.8615495
## sample estimates:
## mean of x mean of y 
## 1.5992232 0.9166667

Mann-Whitney U-Test

The Mann-Whitney U-Test is used to compare the means of two samples for non-parametric data.

Comparison of the Means of the Impact on MR3 for 2 Events

We will compare the impact on two events with reference to the aggregated factor MR3 (Health Advocacy).

Compare the means of the MR3 impact for both events

mean(C2Cagg$X_MR3[C2Cagg$EVENT=="12. What A Wonderful World"])

## [1] 2.155932

mean(C2Cagg$X_MR3[C2Cagg$EVENT=="2. Kindred Spirit Series Virtual Challenge"])

## [1] 1.268254

Declare the Variables in Preparation for Mann-Whitney U Test

MR3.WAWW <- C2Cagg$X_MR3[C2Cagg$EVENT=="12. What A Wonderful World"]
MR3.Kindred <- C2Cagg$X_MR3[C2Cagg$EVENT=="2. Kindred Spirit Series Virtual Challenge"]

Perform a Mann-Whitney U Test (basically an Unpaired Wilcox Test)

wilcox.test(MR3.WAWW, MR3.Kindred, paired=FALSE)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  MR3.WAWW and MR3.Kindred
## W = 31454, p-value = 3.636e-13
## alternative hypothesis: true location shift is not equal to 0

One-Way and Two-Way ANOVA

One-Way ANOVA

The one-way analysis of variance (ANOVA) is used to determine whether there are any statistically significant differences between the means of three or more independent (unrelated) groups. This is applicable for parametric data.

In this one-way ANOVA example, we will model the impact on MR1 (stigma) as a function of the type of event participated in.

one.way <- aov(C2Cagg$X_MR1 ~ C2Cagg$EVENT, data = C2Cagg)
summary(one.way)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## C2Cagg$EVENT   6  397.2   66.21   144.8 <2e-16 ***
## Residuals    955  436.6    0.46                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model summary first lists the independent variables being tested in the model (in this case we have only one, ‘C2Cagg$EVENT’ or EVENT) and the model residuals (‘Residual’). All of the variation that is not explained by the independent variables is called residual variance.

The rest of the values in the output table describe the independent variable and the residuals:

The Df column displays the degrees of freedom for the independent variable (the number of levels in the variable minus 1), and the degrees of freedom for the residuals (the total number of observations minus one and minus the number of levels in the independent variables).
The Sum Sq column displays the sum of squares (a.k.a. the total variation between the group means and the overall mean).
The Mean Sq column is the mean of the sum of squares, calculated by dividing the sum of squares by the degrees of freedom for each parameter.
The F-value column is the test statistic from the F test. This is the mean square of each independent variable divided by the mean square of the residuals. The larger the F value, the more likely it is that the variation caused by the independent variable is real and not due to chance.
The Pr(>F) column is the p-value of the F-statistic. This shows how likely it is that the F-value calculated from the test would have occurred if the null hypothesis of no difference among group means were true.

The p-value of the EVENT variable is low (p < 0.001), so it appears that the type of EVENT participated in has a real impact on the mental health stigma (MR1) levels.

Two-Way ANOVA

In this two-way ANOVA example, we will model the impact on MR1 (stigma) as a function of the type of event participated in, as well as the gender of the partipants.

two.way <- aov(C2Cagg$X_MR1 ~ C2Cagg$EVENT + C2Cagg$GENDER, data = C2Cagg)
summary(two.way)

##                Df Sum Sq Mean Sq F value Pr(>F)    
## C2Cagg$EVENT    6  397.2   66.21 145.522 <2e-16 ***
## C2Cagg$GENDER   1    2.6    2.56   5.625 0.0179 *  
## Residuals     954  434.0    0.45                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Adding interactions between variables

Sometimes you have reason to think that two of your independent variables have an interaction effect rather than an additive effect.

interaction <- aov(C2Cagg$X_MR1 ~ C2Cagg$EVENT*C2Cagg$GENDER, data = C2Cagg)
summary(interaction)

##                             Df Sum Sq Mean Sq F value Pr(>F)    
## C2Cagg$EVENT                 6  397.2   66.21 145.642 <2e-16 ***
## C2Cagg$GENDER                1    2.6    2.56   5.630 0.0179 *  
## C2Cagg$EVENT:C2Cagg$GENDER   6    3.1    0.51   1.131 0.3420    
## Residuals                  948  430.9    0.45                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Finding the Best-Fit Model

There are now three different ANOVA models to explain the data. How do you decide which one to use? Usually you’ll want to use the ‘best-fit’ model – the model that best explains the variation in the dependent variable.

The Akaike information criterion (AIC) is a good test for model fit. AIC calculates the information value of each model by balancing the variation explained against the number of parameters used.

In AIC model selection, we compare the information value of each model and choose the one with the lowest AIC value (a lower number means more information explained!)

library(AICcmodavg)
model.set <- list(one.way, two.way, interaction)
model.names <- c("one.way", "two.way", "interaction")
aictab(model.set, modnames = model.names)

## 
## Model selection based on AICc:
## 
##              K    AICc Delta_AICc AICcWt Cum.Wt      LL
## two.way      9 1982.57       0.00   0.81   0.81 -982.19
## one.way      8 1986.19       3.62   0.13   0.95 -985.02
## interaction 15 1988.02       5.46   0.05   1.00 -978.76

The model with the lowest AIC score (listed first in the table) is the best fit for the data, which in this case would be the two.way model.

Check for homoscedasticity

To check whether the model fits the assumption of homoscedasticity, look at the model diagnostic plots in R using the plot() function:

par(mfrow=c(2,2))
plot(two.way)

par(mfrow=c(1,1))

The diagnostic plots show the unexplained variance (residuals) across the range of the observed data.

Each plot gives a specific piece of information about the model fit, but it’s enough to know that the red line representing the mean of the residuals should be horizontal and centered on zero (or on one, in the scale-location plot), meaning that there are no large outliers that would cause bias in the model.

The normal Q-Q plot plots a regression between the theoretical residuals of a perfectly-homoscedastic model and the actual residuals of your model, so the closer to a slope of 1 this is the better. This Q-Q plot is very close, with only a bit of deviation.

From these diagnostic plots we can say that the model fits the assumption of homoscedasticity.

If your model doesn’t fit the assumption of homoscedasticity, you can try the Kruskall-Wallis test instead. This is discussed in the next section.

Do a post-hoc test

ANOVA tells us if there are differences among group means, but not what the differences are. To find out which groups are statistically different from one another, you can perform a Tukey’s Honestly Significant Difference (Tukey’s HSD) post-hoc test for pairwise comparisons:

tukey.two.way<-TukeyHSD(two.way)
tukey.two.way

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = C2Cagg$X_MR1 ~ C2Cagg$EVENT + C2Cagg$GENDER, data = C2Cagg)
## 
## $`C2Cagg$EVENT`
##                                                                                                        diff
## 11. Off Centre Watch Party & Aftershow-10. C2C Social Media Feature                             0.483262885
## 12. What A Wonderful World-10. C2C Social Media Feature                                         1.918302111
## 2. Kindred Spirit Series Virtual Challenge-10. C2C Social Media Feature                         1.235745614
## 3. Bagan 2020 International Friendship Virtual Race-10. C2C Social Media Feature                0.285951739
## 3. Next Hitmaker Dance Competition-10. C2C Social Media Feature                                 1.787828947
## 5. NBFA Sports Model Championships 2020-10. C2C Social Media Feature                            0.001771255
## 12. What A Wonderful World-11. Off Centre Watch Party & Aftershow                               1.435039226
## 2. Kindred Spirit Series Virtual Challenge-11. Off Centre Watch Party & Aftershow               0.752482729
## 3. Bagan 2020 International Friendship Virtual Race-11. Off Centre Watch Party & Aftershow     -0.197311146
## 3. Next Hitmaker Dance Competition-11. Off Centre Watch Party & Aftershow                       1.304566062
## 5. NBFA Sports Model Championships 2020-11. Off Centre Watch Party & Aftershow                 -0.481491630
## 2. Kindred Spirit Series Virtual Challenge-12. What A Wonderful World                          -0.682556497
## 3. Bagan 2020 International Friendship Virtual Race-12. What A Wonderful World                 -1.632350372
## 3. Next Hitmaker Dance Competition-12. What A Wonderful World                                  -0.130473164
## 5. NBFA Sports Model Championships 2020-12. What A Wonderful World                             -1.916530856
## 3. Bagan 2020 International Friendship Virtual Race-2. Kindred Spirit Series Virtual Challenge -0.949793875
## 3. Next Hitmaker Dance Competition-2. Kindred Spirit Series Virtual Challenge                   0.552083333
## 5. NBFA Sports Model Championships 2020-2. Kindred Spirit Series Virtual Challenge             -1.233974359
## 3. Next Hitmaker Dance Competition-3. Bagan 2020 International Friendship Virtual Race          1.501877208
## 5. NBFA Sports Model Championships 2020-3. Bagan 2020 International Friendship Virtual Race    -0.284180484
## 5. NBFA Sports Model Championships 2020-3. Next Hitmaker Dance Competition                     -1.786057692
##                                                                                                        lwr
## 11. Off Centre Watch Party & Aftershow-10. C2C Social Media Feature                             0.12957100
## 12. What A Wonderful World-10. C2C Social Media Feature                                         1.56199019
## 2. Kindred Spirit Series Virtual Challenge-10. C2C Social Media Feature                         0.88893172
## 3. Bagan 2020 International Friendship Virtual Race-10. C2C Social Media Feature               -0.05836414
## 3. Next Hitmaker Dance Competition-10. C2C Social Media Feature                                 0.91234443
## 5. NBFA Sports Model Championships 2020-10. C2C Social Media Feature                           -0.63856954
## 12. What A Wonderful World-11. Off Centre Watch Party & Aftershow                               1.22763141
## 2. Kindred Spirit Series Virtual Challenge-11. Off Centre Watch Party & Aftershow               0.56185300
## 3. Bagan 2020 International Friendship Virtual Race-11. Off Centre Watch Party & Aftershow     -0.38335750
## 3. Next Hitmaker Dance Competition-11. Off Centre Watch Party & Aftershow                       0.47841054
## 5. NBFA Sports Model Championships 2020-11. Off Centre Watch Party & Aftershow                 -1.05253992
## 2. Kindred Spirit Series Virtual Challenge-12. What A Wonderful World                          -0.87800453
## 3. Bagan 2020 International Friendship Virtual Race-12. What A Wonderful World                 -1.82333069
## 3. Next Hitmaker Dance Competition-12. What A Wonderful World                                  -0.95775376
## 5. NBFA Sports Model Championships 2020-12. What A Wonderful World                             -2.48920562
## 3. Bagan 2020 International Friendship Virtual Race-2. Kindred Spirit Series Virtual Challenge -1.12240666
## 3. Next Hitmaker Dance Competition-2. Kindred Spirit Series Virtual Challenge                  -0.27115107
## 5. NBFA Sports Model Championships 2020-2. Kindred Spirit Series Virtual Challenge             -1.80078833
## 3. Next Hitmaker Dance Competition-3. Bagan 2020 International Friendship Virtual Race          0.67969205
## 5. NBFA Sports Model Championships 2020-3. Bagan 2020 International Friendship Virtual Race    -0.84946946
## 5. NBFA Sports Model Championships 2020-3. Next Hitmaker Dance Competition                     -2.76965890
##                                                                                                        upr
## 11. Off Centre Watch Party & Aftershow-10. C2C Social Media Feature                             0.83695477
## 12. What A Wonderful World-10. C2C Social Media Feature                                         2.27461403
## 2. Kindred Spirit Series Virtual Challenge-10. C2C Social Media Feature                         1.58255951
## 3. Bagan 2020 International Friendship Virtual Race-10. C2C Social Media Feature                0.63026762
## 3. Next Hitmaker Dance Competition-10. C2C Social Media Feature                                 2.66331346
## 5. NBFA Sports Model Championships 2020-10. C2C Social Media Feature                            0.64211205
## 12. What A Wonderful World-11. Off Centre Watch Party & Aftershow                               1.64244705
## 2. Kindred Spirit Series Virtual Challenge-11. Off Centre Watch Party & Aftershow               0.94311246
## 3. Bagan 2020 International Friendship Virtual Race-11. Off Centre Watch Party & Aftershow     -0.01126479
## 3. Next Hitmaker Dance Competition-11. Off Centre Watch Party & Aftershow                       2.13072159
## 5. NBFA Sports Model Championships 2020-11. Off Centre Watch Party & Aftershow                  0.08955666
## 2. Kindred Spirit Series Virtual Challenge-12. What A Wonderful World                          -0.48710847
## 3. Bagan 2020 International Friendship Virtual Race-12. What A Wonderful World                 -1.44137005
## 3. Next Hitmaker Dance Competition-12. What A Wonderful World                                   0.69680744
## 5. NBFA Sports Model Championships 2020-12. What A Wonderful World                             -1.34385609
## 3. Bagan 2020 International Friendship Virtual Race-2. Kindred Spirit Series Virtual Challenge -0.77718109
## 3. Next Hitmaker Dance Competition-2. Kindred Spirit Series Virtual Challenge                   1.37531774
## 5. NBFA Sports Model Championships 2020-2. Kindred Spirit Series Virtual Challenge             -0.66716039
## 3. Next Hitmaker Dance Competition-3. Bagan 2020 International Friendship Virtual Race          2.32406237
## 5. NBFA Sports Model Championships 2020-3. Bagan 2020 International Friendship Virtual Race     0.28110849
## 5. NBFA Sports Model Championships 2020-3. Next Hitmaker Dance Competition                     -0.80245648
##                                                                                                    p adj
## 11. Off Centre Watch Party & Aftershow-10. C2C Social Media Feature                            0.0011425
## 12. What A Wonderful World-10. C2C Social Media Feature                                        0.0000000
## 2. Kindred Spirit Series Virtual Challenge-10. C2C Social Media Feature                        0.0000000
## 3. Bagan 2020 International Friendship Virtual Race-10. C2C Social Media Feature               0.1776993
## 3. Next Hitmaker Dance Competition-10. C2C Social Media Feature                                0.0000000
## 5. NBFA Sports Model Championships 2020-10. C2C Social Media Feature                           1.0000000
## 12. What A Wonderful World-11. Off Centre Watch Party & Aftershow                              0.0000000
## 2. Kindred Spirit Series Virtual Challenge-11. Off Centre Watch Party & Aftershow              0.0000000
## 3. Bagan 2020 International Friendship Virtual Race-11. Off Centre Watch Party & Aftershow     0.0293993
## 3. Next Hitmaker Dance Competition-11. Off Centre Watch Party & Aftershow                      0.0000718
## 5. NBFA Sports Model Championships 2020-11. Off Centre Watch Party & Aftershow                 0.1635139
## 2. Kindred Spirit Series Virtual Challenge-12. What A Wonderful World                          0.0000000
## 3. Bagan 2020 International Friendship Virtual Race-12. What A Wonderful World                 0.0000000
## 3. Next Hitmaker Dance Competition-12. What A Wonderful World                                  0.9992353
## 5. NBFA Sports Model Championships 2020-12. What A Wonderful World                             0.0000000
## 3. Bagan 2020 International Friendship Virtual Race-2. Kindred Spirit Series Virtual Challenge 0.0000000
## 3. Next Hitmaker Dance Competition-2. Kindred Spirit Series Virtual Challenge                  0.4271694
## 5. NBFA Sports Model Championships 2020-2. Kindred Spirit Series Virtual Challenge             0.0000000
## 3. Next Hitmaker Dance Competition-3. Bagan 2020 International Friendship Virtual Race         0.0000018
## 5. NBFA Sports Model Championships 2020-3. Bagan 2020 International Friendship Virtual Race    0.7536613
## 5. NBFA Sports Model Championships 2020-3. Next Hitmaker Dance Competition                     0.0000021
## 
## $`C2Cagg$GENDER`
##                   diff        lwr         upr     p adj
## Male-Female -0.1019972 -0.1881175 -0.01587687 0.0203216

Plot the results in a graph

tukey.plot.aov<-aov(C2Cagg$X_MR1 ~ C2Cagg$EVENT:C2Cagg$GENDER, data = C2Cagg)

Instead of printing the TukeyHSD results in a table, we’ll do it in a graph.

tukey.plot.test<-TukeyHSD(tukey.plot.aov)
plot(tukey.plot.test, las = 1)

Kruskal-Wallis H Test

The Kruskal Wallis test in R is a non-parametric method to test whether multiple groups are identically distributed or not.

We will use this example to model the impact on MR3 (advocacy) as a function of the type of event participated in.

kruskal.test(C2Cagg$X_MR3 ~ C2Cagg$EVENT, data = C2Cagg)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  C2Cagg$X_MR3 by C2Cagg$EVENT
## Kruskal-Wallis chi-squared = 489.26, df = 6, p-value < 2.2e-16

We can now carry out a post-hoc test by applying a pairwise Wilcoxon rank test.

pairwise.wilcox.test(C2Cagg$X_MR3, C2Cagg$EVENT, exact=F)

## 
##  Pairwise comparisons using Wilcoxon rank sum test 
## 
## data:  C2Cagg$X_MR3 and C2Cagg$EVENT 
## 
##                                                     10. C2C Social Media Feature
## 11. Off Centre Watch Party & Aftershow              0.01154                     
## 12. What A Wonderful World                          < 2e-16                     
## 2. Kindred Spirit Series Virtual Challenge          1.8e-09                     
## 3. Bagan 2020 International Friendship Virtual Race 0.51179                     
## 3. Next Hitmaker Dance Competition                  0.00128                     
## 5. NBFA Sports Model Championships 2020             1.00000                     
##                                                     11. Off Centre Watch Party & Aftershow
## 11. Off Centre Watch Party & Aftershow              -                                     
## 12. What A Wonderful World                          < 2e-16                               
## 2. Kindred Spirit Series Virtual Challenge          < 2e-16                               
## 3. Bagan 2020 International Friendship Virtual Race 9.1e-07                               
## 3. Next Hitmaker Dance Competition                  0.00058                               
## 5. NBFA Sports Model Championships 2020             0.24189                               
##                                                     12. What A Wonderful World
## 11. Off Centre Watch Party & Aftershow              -                         
## 12. What A Wonderful World                          -                         
## 2. Kindred Spirit Series Virtual Challenge          5.8e-12                   
## 3. Bagan 2020 International Friendship Virtual Race < 2e-16                   
## 3. Next Hitmaker Dance Competition                  0.34067                   
## 5. NBFA Sports Model Championships 2020             4.2e-08                   
##                                                     2. Kindred Spirit Series Virtual Challenge
## 11. Off Centre Watch Party & Aftershow              -                                         
## 12. What A Wonderful World                          -                                         
## 2. Kindred Spirit Series Virtual Challenge          -                                         
## 3. Bagan 2020 International Friendship Virtual Race < 2e-16                                   
## 3. Next Hitmaker Dance Competition                  1.00000                                   
## 5. NBFA Sports Model Championships 2020             0.00064                                   
##                                                     3. Bagan 2020 International Friendship Virtual Race
## 11. Off Centre Watch Party & Aftershow              -                                                  
## 12. What A Wonderful World                          -                                                  
## 2. Kindred Spirit Series Virtual Challenge          -                                                  
## 3. Bagan 2020 International Friendship Virtual Race -                                                  
## 3. Next Hitmaker Dance Competition                  1.4e-07                                            
## 5. NBFA Sports Model Championships 2020             0.75094                                            
##                                                     3. Next Hitmaker Dance Competition
## 11. Off Centre Watch Party & Aftershow              -                                 
## 12. What A Wonderful World                          -                                 
## 2. Kindred Spirit Series Virtual Challenge          -                                 
## 3. Bagan 2020 International Friendship Virtual Race -                                 
## 3. Next Hitmaker Dance Competition                  -                                 
## 5. NBFA Sports Model Championships 2020             0.00784                           
## 
## P value adjustment method: holm

Pearson’s Correlation Test

Pearson correlation (r), which measures a linear dependence between two variables (x and y). It’s also known as a parametric correlation test because it depends to the distribution of the data. It can be used only when x and y are from a normal distribution.

We will look at whether there is a correlation between MR1 and MR3 impact.

Q-Q Plots Visualisation with ggpubr

library(ggpubr)
x = C2Cagg$X_MR1
y = C2Cagg$X_MR3
qqplot(x, y, xlab = "MR1", ylab = "MR3", main = "Q-Q Plot")

Shapiro-Wilk test for Normality

shapiro.test(C2Cagg$X_MR1)

## 
##  Shapiro-Wilk normality test
## 
## data:  C2Cagg$X_MR1
## W = 0.85049, p-value < 2.2e-16

shapiro.test(C2Cagg$X_MR3)

## 
##  Shapiro-Wilk normality test
## 
## data:  C2Cagg$X_MR3
## W = 0.81902, p-value < 2.2e-16

Pearson Correlation Test for Normally Distributed Data

cor.test(C2Cagg$X_MR1, C2Cagg$X_MR3, method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  C2Cagg$X_MR1 and C2Cagg$X_MR3
## t = 49.635, df = 960, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8295650 0.8651133
## sample estimates:
##       cor 
## 0.8482922

Spearman Rank Correlation Test

Spearman Rank Correlation is used to measure the correlation between two ranked variables, and can be applied when the normality test in the previous section fails.

Let’s try this out on the same dataset from the previous section, namely MR1 and MR3.

cor.test(C2Cagg$X_MR1, C2Cagg$X_MR3, method = "spearman")

## Warning in cor.test.default(C2Cagg$X_MR1, C2Cagg$X_MR3, method = "spearman"):
## Cannot compute exact p-value with ties

## 
##  Spearman's rank correlation rho
## 
## data:  C2Cagg$X_MR1 and C2Cagg$X_MR3
## S = 32783963, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.7790531

Congratulations !

You have completed session 3 of the S3729C Data Analytics Seminar.