03 Data Modelling and Analysis
Class on 28 August 2021
In this section, we will continue to use the Community2Campus dataset which was used in the previous section to carry out the following analyses.
Enjoy !
This is the code for installation of Pacman which is used to load all packages for this section. You have used it in Section 01 and 02 too.
## package 'pacman' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\aaron_chen_angus\AppData\Local\Temp\RtmpCgRKXg\downloaded_packages
Load the packages required for this section
## Error in get(genname, envir = envir) : object 'testthat_print' not found
You can now import the source data from the csv file which I have placed online at Github via the link below using the read.csv command.
https://raw.githubusercontent.com/aaron-chen-angus/community2campus/main/C2CsurveyAggregated.csv
Check on the output by reading the column names
## [1] "UNQ_ID" "AGE.RANGE" "GENDER" "NCSS_B03A"
## [5] "NCSS_B03B" "NCSS_B04" "NCSS_B05" "EVENT"
## [9] "PARTICIPATION" "PART_ROLE" "AFFILIATION" "Process_E_LS"
## [13] "Process_E_EX" "Process_E_CT" "Pre_MR1" "Post_MR1"
## [17] "X_MR1" "Pre_MR2" "Post_MR2" "X_MR2"
## [21] "Pre_MR3" "Post_MR3" "X_MR3" "Pre_MR4"
## [25] "Post_MR4" "X_MR4" "Pre_MR5" "Post_MR5"
## [29] "X_MR5" "Pre_AT_PT1" "Pre_AT_PT2" "Pre_AT_PT3"
## [33] "Pre_AT_WT5" "Pre_AT_CT1" "Pre_AT_CT2" "Pre_AT_CT3"
## [37] "Pre_AT_CT4" "Pre_AT_LA1" "Pre_AT_LA2" "Pre_AT_LA3"
## [41] "Pre_AT_SC6" "Pre_ST_SD1" "Pre_ST_SD2" "Pre_ST_SD3"
## [45] "Pre_ST_SD4" "Pre_NCSS_P2" "Pre_AT_WT2" "Pre_AT_WT3"
## [49] "Pre_AT_WT4" "Pre_AT_LA4" "Pre_AT_WT1" "Pre_AT_CT5"
## [53] "Pre_AT_LA5" "Pre_AT_SC2" "Pre_ST_SD5" "Pre_ST_SR2"
## [57] "Pre_ST_SR1" "Pre_ST_SR3" "Pre_ST_SR4" "Pre_ST_SR5"
## [61] "Pre_NCSS_P1" "Pre_NCSS_P3" "Pre_NCSS_P4" "Pre_NCSS_P5"
## [65] "Pre_AT_SC1" "Pre_AT_SC3" "Pre_AT_SC4" "Pre_AT_SC5"
## [69] "Post_AT_PT1" "Post_AT_PT2" "Post_AT_PT3" "Post_AT_WT5"
## [73] "Post_AT_CT1" "Post_AT_CT2" "Post_AT_CT3" "Post_AT_CT4"
## [77] "Post_AT_LA1" "Post_AT_LA2" "Post_AT_LA3" "Post_ST_SD1"
## [81] "Post_ST_SD2" "Post_ST_SD3" "Post_ST_SD4" "Post_AT_SC6"
## [85] "Post_NCSS_P2" "Post_AT_WT2" "Post_AT_WT3" "Post_AT_WT4"
## [89] "Post_AT_LA4" "Post_AT_WT1" "Post_AT_CT5" "Post_AT_LA5"
## [93] "Post_AT_SC2" "Post_ST_SD5" "Post_ST_SR2" "Post_ST_SR1"
## [97] "Post_ST_SR3" "Post_ST_SR4" "Post_ST_SR5" "Post_NCSS_P1"
## [101] "Post_NCSS_P3" "Post_NCSS_P4" "Post_NCSS_P5" "Post_AT_SC1"
## [105] "Post_AT_SC3" "Post_AT_SC4" "Post_AT_SC5"
The paired T-Test is a very useful statistical technique to compare pre- and post- data in a population. This is applicable when we have parametric data.
We will now carry out a paired T-Test (with the assumtion of normality) on the Pre- vs Post- values for the Aggregated Factors in the C2C survey.
Paired T-Test for MR1 Pre- and Post- Event
## [1] 2.663851
## [1] 3.213682
##
## Paired t-test
##
## data: C2Cagg$Pre_MR1 and C2Cagg$Post_MR1
## t = -18.308, df = 961, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.6087675 -0.4908946
## sample estimates:
## mean of the differences
## -0.5498311
Paired T-Test for MR2 Pre- and Post- Event
## [1] 2.834511
## [1] 2.859459
##
## Paired t-test
##
## data: C2Cagg$Pre_MR2 and C2Cagg$Post_MR2
## t = -1.7998, df = 961, p-value = 0.07221
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.052151106 0.002255056
## sample estimates:
## mean of the differences
## -0.02494802
Paired T-Test for MR3 Pre- and Post- Event
## [1] 2.926819
## [1] 3.68711
##
## Paired t-test
##
## data: C2Cagg$Pre_MR3 and C2Cagg$Post_MR3
## t = -20.62, df = 961, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.8326508 -0.6879313
## sample estimates:
## mean of the differences
## -0.7602911
Paired T-Test for MR4 Pre- and Post- Event
## [1] 2.898909
## [1] 3.015073
##
## Paired t-test
##
## data: C2Cagg$Pre_MR4 and C2Cagg$Post_MR4
## t = -6.9026, df = 961, p-value = 9.265e-12
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.14919031 -0.08313818
## sample estimates:
## mean of the differences
## -0.1161642
Paired T-Test for MR5 Pre- and Post- Event
## [1] 2.683472
## [1] 2.97921
##
## Paired t-test
##
## data: C2Cagg$Pre_MR5 and C2Cagg$Post_MR5
## t = -12.255, df = 961, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.3430950 -0.2483811
## sample estimates:
## mean of the differences
## -0.295738
The Wilcoxon Rank-Signed Test is a used to compare pre- and post- data when we have non-parametric data.
We will now carry out a Wilcoxon Rank-Signed Test (with the assumtion of non-normality) on the Pre- vs Post- values for the Aggregated Factors in the C2C survey.
Wilcoxon Rank-Signed Test for MR1 Pre- and Post- Event
## [1] 2.663851
## [1] 3.213682
##
## Wilcoxon signed rank test with continuity correction
##
## data: C2Cagg$Pre_MR1 and C2Cagg$Post_MR1
## V = 19053, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
Wilcoxon Rank-Signed Test for MR2 Pre- and Post- Event
## [1] 2.834511
## [1] 2.859459
##
## Wilcoxon signed rank test with continuity correction
##
## data: C2Cagg$Pre_MR2 and C2Cagg$Post_MR2
## V = 52749, p-value = 0.02623
## alternative hypothesis: true location shift is not equal to 0
Wilcoxon Rank-Signed Test for MR3 Pre- and Post- Event
## [1] 2.926819
## [1] 3.68711
##
## Wilcoxon signed rank test with continuity correction
##
## data: C2Cagg$Pre_MR3 and C2Cagg$Post_MR3
## V = 11818, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
Wilcoxon Rank-Signed Test for MR4 Pre- and Post- Event
## [1] 2.898909
## [1] 3.015073
##
## Wilcoxon signed rank test with continuity correction
##
## data: C2Cagg$Pre_MR4 and C2Cagg$Post_MR4
## V = 25499, p-value = 2.155e-12
## alternative hypothesis: true location shift is not equal to 0
Wilcoxon Rank-Signed Test for MR5 Pre- and Post- Event
## [1] 2.683472
## [1] 2.97921
##
## Wilcoxon signed rank test with continuity correction
##
## data: C2Cagg$Pre_MR5 and C2Cagg$Post_MR5
## V = 27777, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
The two sample T-Test is used to compare the means of two samples for parametric data.
Comparison of the Means of the Impact on MR1 for 2 Events
We will compare the impact on two events with reference to the aggregated factor MR1 (Stigma).
Compare the means of the MR1 impact for both events
## [1] 1.599223
## [1] 0.9166667
Perform an F test to compare two variances
MR1.WAWW <- C2Cagg$X_MR1[C2Cagg$EVENT=="12. What A Wonderful World"]
MR1.Kindred <- C2Cagg$X_MR1[C2Cagg$EVENT=="2. Kindred Spirit Series Virtual Challenge"]
var.test(x = MR1.WAWW, y = MR1.Kindred)##
## F test to compare two variances
##
## data: MR1.WAWW and MR1.Kindred
## F = 0.65682, num df = 176, denom df = 251, p-value = 0.003031
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.5014253 0.8661124
## sample estimates:
## ratio of variances
## 0.6568221
Perform a two-sample t-Test for Unequal Variance (Welch)
##
## Welch Two Sample t-test
##
## data: MR1.WAWW and MR1.Kindred
## t = 7.7723, df = 418.28, p-value = 6.035e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.5099358 0.8551772
## sample estimates:
## mean of x mean of y
## 1.5992232 0.9166667
If the Variance was Equal, Perform this Test Instead (Default)
##
## Two Sample t-test
##
## data: MR1.WAWW and MR1.Kindred
## t = 7.4952, df = 427, p-value = 3.853e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.5035635 0.8615495
## sample estimates:
## mean of x mean of y
## 1.5992232 0.9166667
The Mann-Whitney U-Test is used to compare the means of two samples for non-parametric data.
Comparison of the Means of the Impact on MR3 for 2 Events
We will compare the impact on two events with reference to the aggregated factor MR3 (Health Advocacy).
Compare the means of the MR3 impact for both events
## [1] 2.155932
## [1] 1.268254
Declare the Variables in Preparation for Mann-Whitney U Test
MR3.WAWW <- C2Cagg$X_MR3[C2Cagg$EVENT=="12. What A Wonderful World"]
MR3.Kindred <- C2Cagg$X_MR3[C2Cagg$EVENT=="2. Kindred Spirit Series Virtual Challenge"]Perform a Mann-Whitney U Test (basically an Unpaired Wilcox Test)
##
## Wilcoxon rank sum test with continuity correction
##
## data: MR3.WAWW and MR3.Kindred
## W = 31454, p-value = 3.636e-13
## alternative hypothesis: true location shift is not equal to 0
One-Way ANOVA
The one-way analysis of variance (ANOVA) is used to determine whether there are any statistically significant differences between the means of three or more independent (unrelated) groups. This is applicable for parametric data.
In this one-way ANOVA example, we will model the impact on MR1 (stigma) as a function of the type of event participated in.
## Df Sum Sq Mean Sq F value Pr(>F)
## C2Cagg$EVENT 6 397.2 66.21 144.8 <2e-16 ***
## Residuals 955 436.6 0.46
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model summary first lists the independent variables being tested in the model (in this case we have only one, ‘C2Cagg$EVENT’ or EVENT) and the model residuals (‘Residual’). All of the variation that is not explained by the independent variables is called residual variance.
The rest of the values in the output table describe the independent variable and the residuals:
The p-value of the EVENT variable is low (p < 0.001), so it appears that the type of EVENT participated in has a real impact on the mental health stigma (MR1) levels.
Two-Way ANOVA
In this two-way ANOVA example, we will model the impact on MR1 (stigma) as a function of the type of event participated in, as well as the gender of the partipants.
## Df Sum Sq Mean Sq F value Pr(>F)
## C2Cagg$EVENT 6 397.2 66.21 145.522 <2e-16 ***
## C2Cagg$GENDER 1 2.6 2.56 5.625 0.0179 *
## Residuals 954 434.0 0.45
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Adding interactions between variables
Sometimes you have reason to think that two of your independent variables have an interaction effect rather than an additive effect.
## Df Sum Sq Mean Sq F value Pr(>F)
## C2Cagg$EVENT 6 397.2 66.21 145.642 <2e-16 ***
## C2Cagg$GENDER 1 2.6 2.56 5.630 0.0179 *
## C2Cagg$EVENT:C2Cagg$GENDER 6 3.1 0.51 1.131 0.3420
## Residuals 948 430.9 0.45
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Finding the Best-Fit Model
There are now three different ANOVA models to explain the data. How do you decide which one to use? Usually you’ll want to use the ‘best-fit’ model – the model that best explains the variation in the dependent variable.
The Akaike information criterion (AIC) is a good test for model fit. AIC calculates the information value of each model by balancing the variation explained against the number of parameters used.
In AIC model selection, we compare the information value of each model and choose the one with the lowest AIC value (a lower number means more information explained!)
library(AICcmodavg)
model.set <- list(one.way, two.way, interaction)
model.names <- c("one.way", "two.way", "interaction")
aictab(model.set, modnames = model.names)##
## Model selection based on AICc:
##
## K AICc Delta_AICc AICcWt Cum.Wt LL
## two.way 9 1982.57 0.00 0.81 0.81 -982.19
## one.way 8 1986.19 3.62 0.13 0.95 -985.02
## interaction 15 1988.02 5.46 0.05 1.00 -978.76
The model with the lowest AIC score (listed first in the table) is the best fit for the data, which in this case would be the two.way model.
Check for homoscedasticity
To check whether the model fits the assumption of homoscedasticity, look at the model diagnostic plots in R using the plot() function:
The diagnostic plots show the unexplained variance (residuals) across the range of the observed data.
Each plot gives a specific piece of information about the model fit, but it’s enough to know that the red line representing the mean of the residuals should be horizontal and centered on zero (or on one, in the scale-location plot), meaning that there are no large outliers that would cause bias in the model.
The normal Q-Q plot plots a regression between the theoretical residuals of a perfectly-homoscedastic model and the actual residuals of your model, so the closer to a slope of 1 this is the better. This Q-Q plot is very close, with only a bit of deviation.
From these diagnostic plots we can say that the model fits the assumption of homoscedasticity.
If your model doesn’t fit the assumption of homoscedasticity, you can try the Kruskall-Wallis test instead. This is discussed in the next section.
Do a post-hoc test
ANOVA tells us if there are differences among group means, but not what the differences are. To find out which groups are statistically different from one another, you can perform a Tukey’s Honestly Significant Difference (Tukey’s HSD) post-hoc test for pairwise comparisons:
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = C2Cagg$X_MR1 ~ C2Cagg$EVENT + C2Cagg$GENDER, data = C2Cagg)
##
## $`C2Cagg$EVENT`
## diff
## 11. Off Centre Watch Party & Aftershow-10. C2C Social Media Feature 0.483262885
## 12. What A Wonderful World-10. C2C Social Media Feature 1.918302111
## 2. Kindred Spirit Series Virtual Challenge-10. C2C Social Media Feature 1.235745614
## 3. Bagan 2020 International Friendship Virtual Race-10. C2C Social Media Feature 0.285951739
## 3. Next Hitmaker Dance Competition-10. C2C Social Media Feature 1.787828947
## 5. NBFA Sports Model Championships 2020-10. C2C Social Media Feature 0.001771255
## 12. What A Wonderful World-11. Off Centre Watch Party & Aftershow 1.435039226
## 2. Kindred Spirit Series Virtual Challenge-11. Off Centre Watch Party & Aftershow 0.752482729
## 3. Bagan 2020 International Friendship Virtual Race-11. Off Centre Watch Party & Aftershow -0.197311146
## 3. Next Hitmaker Dance Competition-11. Off Centre Watch Party & Aftershow 1.304566062
## 5. NBFA Sports Model Championships 2020-11. Off Centre Watch Party & Aftershow -0.481491630
## 2. Kindred Spirit Series Virtual Challenge-12. What A Wonderful World -0.682556497
## 3. Bagan 2020 International Friendship Virtual Race-12. What A Wonderful World -1.632350372
## 3. Next Hitmaker Dance Competition-12. What A Wonderful World -0.130473164
## 5. NBFA Sports Model Championships 2020-12. What A Wonderful World -1.916530856
## 3. Bagan 2020 International Friendship Virtual Race-2. Kindred Spirit Series Virtual Challenge -0.949793875
## 3. Next Hitmaker Dance Competition-2. Kindred Spirit Series Virtual Challenge 0.552083333
## 5. NBFA Sports Model Championships 2020-2. Kindred Spirit Series Virtual Challenge -1.233974359
## 3. Next Hitmaker Dance Competition-3. Bagan 2020 International Friendship Virtual Race 1.501877208
## 5. NBFA Sports Model Championships 2020-3. Bagan 2020 International Friendship Virtual Race -0.284180484
## 5. NBFA Sports Model Championships 2020-3. Next Hitmaker Dance Competition -1.786057692
## lwr
## 11. Off Centre Watch Party & Aftershow-10. C2C Social Media Feature 0.12957100
## 12. What A Wonderful World-10. C2C Social Media Feature 1.56199019
## 2. Kindred Spirit Series Virtual Challenge-10. C2C Social Media Feature 0.88893172
## 3. Bagan 2020 International Friendship Virtual Race-10. C2C Social Media Feature -0.05836414
## 3. Next Hitmaker Dance Competition-10. C2C Social Media Feature 0.91234443
## 5. NBFA Sports Model Championships 2020-10. C2C Social Media Feature -0.63856954
## 12. What A Wonderful World-11. Off Centre Watch Party & Aftershow 1.22763141
## 2. Kindred Spirit Series Virtual Challenge-11. Off Centre Watch Party & Aftershow 0.56185300
## 3. Bagan 2020 International Friendship Virtual Race-11. Off Centre Watch Party & Aftershow -0.38335750
## 3. Next Hitmaker Dance Competition-11. Off Centre Watch Party & Aftershow 0.47841054
## 5. NBFA Sports Model Championships 2020-11. Off Centre Watch Party & Aftershow -1.05253992
## 2. Kindred Spirit Series Virtual Challenge-12. What A Wonderful World -0.87800453
## 3. Bagan 2020 International Friendship Virtual Race-12. What A Wonderful World -1.82333069
## 3. Next Hitmaker Dance Competition-12. What A Wonderful World -0.95775376
## 5. NBFA Sports Model Championships 2020-12. What A Wonderful World -2.48920562
## 3. Bagan 2020 International Friendship Virtual Race-2. Kindred Spirit Series Virtual Challenge -1.12240666
## 3. Next Hitmaker Dance Competition-2. Kindred Spirit Series Virtual Challenge -0.27115107
## 5. NBFA Sports Model Championships 2020-2. Kindred Spirit Series Virtual Challenge -1.80078833
## 3. Next Hitmaker Dance Competition-3. Bagan 2020 International Friendship Virtual Race 0.67969205
## 5. NBFA Sports Model Championships 2020-3. Bagan 2020 International Friendship Virtual Race -0.84946946
## 5. NBFA Sports Model Championships 2020-3. Next Hitmaker Dance Competition -2.76965890
## upr
## 11. Off Centre Watch Party & Aftershow-10. C2C Social Media Feature 0.83695477
## 12. What A Wonderful World-10. C2C Social Media Feature 2.27461403
## 2. Kindred Spirit Series Virtual Challenge-10. C2C Social Media Feature 1.58255951
## 3. Bagan 2020 International Friendship Virtual Race-10. C2C Social Media Feature 0.63026762
## 3. Next Hitmaker Dance Competition-10. C2C Social Media Feature 2.66331346
## 5. NBFA Sports Model Championships 2020-10. C2C Social Media Feature 0.64211205
## 12. What A Wonderful World-11. Off Centre Watch Party & Aftershow 1.64244705
## 2. Kindred Spirit Series Virtual Challenge-11. Off Centre Watch Party & Aftershow 0.94311246
## 3. Bagan 2020 International Friendship Virtual Race-11. Off Centre Watch Party & Aftershow -0.01126479
## 3. Next Hitmaker Dance Competition-11. Off Centre Watch Party & Aftershow 2.13072159
## 5. NBFA Sports Model Championships 2020-11. Off Centre Watch Party & Aftershow 0.08955666
## 2. Kindred Spirit Series Virtual Challenge-12. What A Wonderful World -0.48710847
## 3. Bagan 2020 International Friendship Virtual Race-12. What A Wonderful World -1.44137005
## 3. Next Hitmaker Dance Competition-12. What A Wonderful World 0.69680744
## 5. NBFA Sports Model Championships 2020-12. What A Wonderful World -1.34385609
## 3. Bagan 2020 International Friendship Virtual Race-2. Kindred Spirit Series Virtual Challenge -0.77718109
## 3. Next Hitmaker Dance Competition-2. Kindred Spirit Series Virtual Challenge 1.37531774
## 5. NBFA Sports Model Championships 2020-2. Kindred Spirit Series Virtual Challenge -0.66716039
## 3. Next Hitmaker Dance Competition-3. Bagan 2020 International Friendship Virtual Race 2.32406237
## 5. NBFA Sports Model Championships 2020-3. Bagan 2020 International Friendship Virtual Race 0.28110849
## 5. NBFA Sports Model Championships 2020-3. Next Hitmaker Dance Competition -0.80245648
## p adj
## 11. Off Centre Watch Party & Aftershow-10. C2C Social Media Feature 0.0011425
## 12. What A Wonderful World-10. C2C Social Media Feature 0.0000000
## 2. Kindred Spirit Series Virtual Challenge-10. C2C Social Media Feature 0.0000000
## 3. Bagan 2020 International Friendship Virtual Race-10. C2C Social Media Feature 0.1776993
## 3. Next Hitmaker Dance Competition-10. C2C Social Media Feature 0.0000000
## 5. NBFA Sports Model Championships 2020-10. C2C Social Media Feature 1.0000000
## 12. What A Wonderful World-11. Off Centre Watch Party & Aftershow 0.0000000
## 2. Kindred Spirit Series Virtual Challenge-11. Off Centre Watch Party & Aftershow 0.0000000
## 3. Bagan 2020 International Friendship Virtual Race-11. Off Centre Watch Party & Aftershow 0.0293993
## 3. Next Hitmaker Dance Competition-11. Off Centre Watch Party & Aftershow 0.0000718
## 5. NBFA Sports Model Championships 2020-11. Off Centre Watch Party & Aftershow 0.1635139
## 2. Kindred Spirit Series Virtual Challenge-12. What A Wonderful World 0.0000000
## 3. Bagan 2020 International Friendship Virtual Race-12. What A Wonderful World 0.0000000
## 3. Next Hitmaker Dance Competition-12. What A Wonderful World 0.9992353
## 5. NBFA Sports Model Championships 2020-12. What A Wonderful World 0.0000000
## 3. Bagan 2020 International Friendship Virtual Race-2. Kindred Spirit Series Virtual Challenge 0.0000000
## 3. Next Hitmaker Dance Competition-2. Kindred Spirit Series Virtual Challenge 0.4271694
## 5. NBFA Sports Model Championships 2020-2. Kindred Spirit Series Virtual Challenge 0.0000000
## 3. Next Hitmaker Dance Competition-3. Bagan 2020 International Friendship Virtual Race 0.0000018
## 5. NBFA Sports Model Championships 2020-3. Bagan 2020 International Friendship Virtual Race 0.7536613
## 5. NBFA Sports Model Championships 2020-3. Next Hitmaker Dance Competition 0.0000021
##
## $`C2Cagg$GENDER`
## diff lwr upr p adj
## Male-Female -0.1019972 -0.1881175 -0.01587687 0.0203216
Plot the results in a graph
Instead of printing the TukeyHSD results in a table, we’ll do it in a graph.
The Kruskal Wallis test in R is a non-parametric method to test whether multiple groups are identically distributed or not.
We will use this example to model the impact on MR3 (advocacy) as a function of the type of event participated in.
##
## Kruskal-Wallis rank sum test
##
## data: C2Cagg$X_MR3 by C2Cagg$EVENT
## Kruskal-Wallis chi-squared = 489.26, df = 6, p-value < 2.2e-16
We can now carry out a post-hoc test by applying a pairwise Wilcoxon rank test.
##
## Pairwise comparisons using Wilcoxon rank sum test
##
## data: C2Cagg$X_MR3 and C2Cagg$EVENT
##
## 10. C2C Social Media Feature
## 11. Off Centre Watch Party & Aftershow 0.01154
## 12. What A Wonderful World < 2e-16
## 2. Kindred Spirit Series Virtual Challenge 1.8e-09
## 3. Bagan 2020 International Friendship Virtual Race 0.51179
## 3. Next Hitmaker Dance Competition 0.00128
## 5. NBFA Sports Model Championships 2020 1.00000
## 11. Off Centre Watch Party & Aftershow
## 11. Off Centre Watch Party & Aftershow -
## 12. What A Wonderful World < 2e-16
## 2. Kindred Spirit Series Virtual Challenge < 2e-16
## 3. Bagan 2020 International Friendship Virtual Race 9.1e-07
## 3. Next Hitmaker Dance Competition 0.00058
## 5. NBFA Sports Model Championships 2020 0.24189
## 12. What A Wonderful World
## 11. Off Centre Watch Party & Aftershow -
## 12. What A Wonderful World -
## 2. Kindred Spirit Series Virtual Challenge 5.8e-12
## 3. Bagan 2020 International Friendship Virtual Race < 2e-16
## 3. Next Hitmaker Dance Competition 0.34067
## 5. NBFA Sports Model Championships 2020 4.2e-08
## 2. Kindred Spirit Series Virtual Challenge
## 11. Off Centre Watch Party & Aftershow -
## 12. What A Wonderful World -
## 2. Kindred Spirit Series Virtual Challenge -
## 3. Bagan 2020 International Friendship Virtual Race < 2e-16
## 3. Next Hitmaker Dance Competition 1.00000
## 5. NBFA Sports Model Championships 2020 0.00064
## 3. Bagan 2020 International Friendship Virtual Race
## 11. Off Centre Watch Party & Aftershow -
## 12. What A Wonderful World -
## 2. Kindred Spirit Series Virtual Challenge -
## 3. Bagan 2020 International Friendship Virtual Race -
## 3. Next Hitmaker Dance Competition 1.4e-07
## 5. NBFA Sports Model Championships 2020 0.75094
## 3. Next Hitmaker Dance Competition
## 11. Off Centre Watch Party & Aftershow -
## 12. What A Wonderful World -
## 2. Kindred Spirit Series Virtual Challenge -
## 3. Bagan 2020 International Friendship Virtual Race -
## 3. Next Hitmaker Dance Competition -
## 5. NBFA Sports Model Championships 2020 0.00784
##
## P value adjustment method: holm
Pearson correlation (r), which measures a linear dependence between two variables (x and y). It’s also known as a parametric correlation test because it depends to the distribution of the data. It can be used only when x and y are from a normal distribution.
We will look at whether there is a correlation between MR1 and MR3 impact.
Q-Q Plots Visualisation with ggpubr
library(ggpubr)
x = C2Cagg$X_MR1
y = C2Cagg$X_MR3
qqplot(x, y, xlab = "MR1", ylab = "MR3", main = "Q-Q Plot")Shapiro-Wilk test for Normality
##
## Shapiro-Wilk normality test
##
## data: C2Cagg$X_MR1
## W = 0.85049, p-value < 2.2e-16
##
## Shapiro-Wilk normality test
##
## data: C2Cagg$X_MR3
## W = 0.81902, p-value < 2.2e-16
Pearson Correlation Test for Normally Distributed Data
##
## Pearson's product-moment correlation
##
## data: C2Cagg$X_MR1 and C2Cagg$X_MR3
## t = 49.635, df = 960, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8295650 0.8651133
## sample estimates:
## cor
## 0.8482922
Spearman Rank Correlation is used to measure the correlation between two ranked variables, and can be applied when the normality test in the previous section fails.
Let’s try this out on the same dataset from the previous section, namely MR1 and MR3.
## Warning in cor.test.default(C2Cagg$X_MR1, C2Cagg$X_MR3, method = "spearman"):
## Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: C2Cagg$X_MR1 and C2Cagg$X_MR3
## S = 32783963, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.7790531
Congratulations !
You have completed session 3 of the S3729C Data Analytics Seminar.