1. Introduction
2. Missing Data Analysis
3. Results
- 3.1 Plots of Categorical Variables
- 3.2 Plots of Continuous Variables

1. Introduction

RA<-read.csv("RA_data.csv") %>%
  dplyr::select(-Subject_ID) %>%
  mutate(CONDITION=recode(CONDITION, '1'="1", '2'="2",'3'="3",'4'="4"))

A total of 237 users participated in the research. Each person was assigned to one of four condition groups (3 experimental with a different manipulation and 1 control condition) and were assessed on 26 different dependent variables.

The goal of this analysis is to process how four conditions affect particpants in other 26 measurements. A closer look at the dataset shows us that, for example, for variables like \(Fear\), \(Anger\), \(Hope\), and \(Empathy\), to name a few, they are Likert data used to allow individuals to rate a particular statement on a level of agreement/disagreement. On the other side, variables like \(GC\_level\_3\), \(GC\_level\_4\), \(GC\_level\_beta\), \(Social\_Distance\_SCL\) and \(Resource\_allocation\_SCL\), are numerical. We first process the categorical variables employing bar graphs, cleveland dot plot, diverging stacked bar chart, etc. to visualize the difference between the four conditions for the categorical dependent variables. Then, we use ridgeline plots, parallel coordinate plots and scatterplots to understand relationships between numerical dependent variables.

2. Missing Data Analysis

visna(RA, sort = "b")

We visualized missing patterns using the visna (VISualize NA) function. Here the rows represent a missing pattern and the columns represent the column level missing values.

A simple sum of columns tells the extent of missing value per variable. From the total number of observations, 237, we can note that 36 values are missing for the self-report emotion ratings like Empathy, Despair, Hope of each participant. And specifically, \(Surprise\) contains 25.74% of NAs, and there is no data input of \(Surprise\) for \(CONDITION\) 4. This fact may be due to loss of participants’ responses or the survey didn’t contain those questions by the time recruting the first 36 participants, but certainly it is a point to investigate further.

In the following report, only some representative categorical and numerical variables were used in each plot, but those ideas are applicable and reproducible to conduct a more inclusive analysis if time permitted.

3. Results

3.1 Plots of Categorical Variables

3.1.1 Cleveland Dot Plot

Before digging into the details of condition group differences, let’s first take a look at the mean value distribution of the 26 dependent variables by \(CONDITION\). Here we can see that all the variables have been plotted in a frame that allows one to zoom in and out. The markers are used to denote the index of each individual dependent variable. If we hover over a marker, it displays corresponding mean value, variable name, and \(CONDITION\) group. The variables are sorted (highest on top) by the mean value for the \(CONDITION\) 4 (left panel). \(Surprise\) ranked highest due to missing values.

Mean.depvar<-RA %>%
  group_by(CONDITION) %>%
  summarise_all(mean, na.rm=TRUE) %>%
  gather(key = "depvar", value = "Mean", GC_level_3:Policy_SCL)

cd1 <- ggplot(Mean.depvar, aes(x = Mean,
                               y = fct_reorder2(depvar, CONDITION, -Mean),
                               colour = CONDITION)) +
  geom_point() + ylab("") +
  ggtitle("Mean Value for 26 Dependent Variables by Condition")


Scale.RA<-RA
Scale.RA[,-1]<-scale(RA[,-1])

Scale.RA <- Scale.RA %>% 
  group_by(CONDITION) %>%
  summarize_all(mean,na.rm=TRUE)%>%
  gather( "depvar", "Mean",GC_level_3:Policy_SCL)

cd2 <- ggplot(Scale.RA, aes(x = Mean,
                            y = fct_reorder2(depvar, CONDITION, -Mean),
                            color = CONDITION)) +
  geom_point() + ylab("") +
  ggtitle("Scaled Mean Value Sorted by Condition 4")

ggplotly(cd1)

ggplotly(cd2)

Since those variables have very different scales, we now standardize the values for each variable in the lower panel.

The order of the assessment is different in the graphs in the upper and lower panels. For example, \(GC\_level\_4\) ranked top 2 and \(GC\_level\_3\) ranked top 3 in upper panel but they ranked bottom and middle in lower panel. In upper panel, we compared the raw numbers of each dependent variable, while in the lower panel we concerned how spread out those dependent variables are among the mean.

More importantly, users in \(CONDITION\ 4\) are more neutral in each measurement, the standardized values are approximating zero. However, users in \(CONDITION\ 1\) have opposing values than those in \(CONDITION\ 3\), while they are low in variables from \(Surprise\) to \(GC\_level\_4\), \(CONDITION\ 3\) users are high.

3.1.2 Box Plot + Bar Graph

To get more details about the distribution of different types of conditions, I created a new plot putting box plot and bar graph in the same frame.

##reorder the multiple box plots by median
box1 <-
  ggplot(RA, aes(
  x = reorder(CONDITION, -Simon_Dual_identity_SCL, median),
  y = Simon_Dual_identity_SCL
  )) +
  geom_boxplot(fill = "lightBlue") +
  #labs(x = "") +
  ggtitle("Indentification with both groups")

##add overall median line
simonorderdesc<- RA %>% group_by(CONDITION)%>% summarise(count=n())

gb<-box1 + geom_hline(yintercept = median(RA$Simon_Dual_identity_SCL),
                  color = "red")+
  annotate("text", x=1:4, y = 6, 
             label = simonorderdesc$count, color = "blue",
             size = 6) +
    ggtitle("count:") + theme_grey(14) +
    theme(plot.title = element_text(color = "blue"))+
  coord_flip()


box2 <-
  ggplot(RA, aes(
  x = reorder(CONDITION, -Angst_SCL, median),
  y = Angst_SCL
  )) +
  geom_boxplot(fill = "lightBlue") +
  labs(x = "") +
  ggtitle("Anxiety from group encounters")

##add overall median line
#box2 + geom_hline(yintercept = median(RA$Angst_SCL),
#                  color = "red")

gh <- ggplot(transform(RA, CONDITION=factor(CONDITION, levels=c("4", "3", "2", "1")))) + 
    geom_histogram(aes(x = Simon_Dual_identity_SCL, y = ..density..),
                   color = "blue", fill = "lightblue") +
facet_wrap(~CONDITION, nrow = 4, strip.position = "top") +
  theme(strip.placement = "outside",
        strip.background = element_blank(),
        strip.text = element_text(face = "bold"))

#reorder(CONDITION, -Simon_Dual_identity_SCL, median)
#grid.arrange(gb, gh, nrow=1)
g<-subplot(gb, gh)
g %>% layout(annotations = list(
  list(x = 0.5 , y = 1.1, text = "Simon_Dual_identity_SC", showarrow = F, xref='paper', yref='paper'),
  list(x = 0.8 , y = 1.05, text = "", showarrow = F, xref='paper', yref='paper'))
)

Taking the variable \(Simon\_Dual\_identity\_SC\) for example, we found:

The median of condition 4 is significantly lower than the other 3 groups.
The medians of conditions 1-3 are very close to the overall median of \(Simon\_Dual\_identity\_SC\) of all groups.
\(CONDITION\) 1 has minimum variance with 6 outliers, while \(CONDITION\) 3 has maximum variance.

3.1.3 Diverging Stacked Bar Chart

Since participants reported one of the six levels of agreement or disagreement of their emotions-\(Fear\), \(Anger\), \(Hope\), \(Empathy\) and \(Despair\), I am interested in comparing the percentage of agreement/disagreement by each emotion groups.

lbar<-likert(CONDITION~.|group, new_df, layout=c(1, 5), 
           scales=list(y=list(relation="free")),between=list(y=1),
           strip.left=strip.custom(bg="gray90"),strip=FALSE,
           par.strip.text=list(cex=1, lines=5), #ylab=NULL, cex=1.5,
           #positive.order = TRUE,
           as.percent = TRUE, 
           main = "Five Emotions Assesments of 4 Conditions", 
           sub = "Levels",
           xlab = "percent") 
lbar

There are five panels in the plot, each shows a different emotion assessment. The users were assigned to one of the four condition groups named in the left axis labels. The number of people in each group is indicated as the right axis label. Each stacked bar is 100% wide. Red on the left indicates lower level and blue on the right indicates higher level. Each is partitioned by the percent of that emotion level chosen. Telling from the first and last panel, people in \(CONDITION\) group 1 are lowest \(Angery\) and have high \(Hopes\), while the other three conditons share similar scales distribution. The second panel from the top indicates that \(CONDITION\) 1 users are least \(Despaired\) while condition 2 users are most \(Despaired\), and the fourth panel shows condition group 3 people are more likely to report \(Fear\).

3.2 Plots of Continuous Variables

3.2.1 Ridgeline Plot

In this second part, we analyzed the numerical data of each participant in each \(CONDITION\) group.

All \(CONDITION\) groups seem to have a very similar distribution with some minor differences, being groups from \(CONDITION\) 1 the ones with higher median of \(Symb \_Threat\_SCL\) and \(Real\_Threat\_SCL\). But, certainly, all condition groups have the same range of data and similar quartiles and median scores.

data<-RA %>%
  dplyr::select(Simon_Dual_identity_SCL:Real_Threat_SCL, CONDITION) %>%
  gather(key = "emotion", value, Simon_Dual_identity_SCL:Real_Threat_SCL); 
  
rp1 <- ggplot(data,
       aes(x = value, y = emotion, fill = CONDITION)) +
  geom_density_ridges(scale = 1, alpha = 0.5)  +
  labs(x = "VALUE", y = "") +
  ggtitle("")

rp1

A ridgeline plot seems to reveal that distributions of \(Symb \_Threat\_SCL\), \(Stereotype\_SCL\) and \(Real\_Threat\_SCL\) are right-skewed, with high concentration on the low portion of the scale, considering the values range from 1 to 6. On the other hand, distributions of \(Simon\_Dual\_identity\_SCL\) and \(Angst\_SCL\) are bell-shaped, concentrating in the center (VALUE close to 4) and decreasing on both sides.

3.2.2 Parallel Coordinate Plot

# scale = std (default)
pcp1 <- ggparcoord(RA,
  columns = c(5:16, 25:27),
  alphaLines = .5,
  scale = "uniminmax",
  splineFactor = 10,
  groupColumn = 1
  )+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))


RA_subset<-dplyr::select(RA, CONDITION, 5:16, 25:27)

#condition 4
RS_4<-within(RA_subset, condition<-if_else(CONDITION=="4", "Condition 4", "Other"))  

pcp2 <-ggparcoord(RS_4[order(RS_4$condition, decreasing = F), ],
  columns = c(2:16),
  groupColumn = "condition",
  alphaLines = 0.8,
  title = "Parallel Coordinate Plot showing trends for CONDITION 4 users",
  scale = "uniminmax"
  ) +
  scale_color_manual(values = c("maroon", "gray"))+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#condition 1
RS_1<-within(RA_subset, condition<-if_else(CONDITION=="1", "Condition 1", "Other"))  

pcp3 <-ggparcoord(RS_1[order(RS_1$condition, decreasing = F), ],
  columns = c(2:16),
  groupColumn = "condition",
  alphaLines = 0.8,
  title = "Parallel Coordinate Plot showing trends for CONDITION 1 users",
  scale = "uniminmax"
  ) +
  scale_color_manual(values = c("maroon", "gray"))+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#condition 2
RS_2<-within(RA_subset, condition<-if_else(CONDITION=="2", "Condition 2", "Other"))  

pcp4 <-ggparcoord(RS_2[order(RS_2$condition, decreasing = F), ],
  columns = c(2:16),
  groupColumn = "condition",
  alphaLines = 0.8,
  title = "Parallel Coordinate Plot showing trends for CONDITION 2 users",
  scale = "uniminmax"
  ) +
  scale_color_manual(values = c("maroon", "gray"))+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#condition 3
RS_3<-within(RA_subset, condition<-if_else(CONDITION=="3", "Condition 3", "Other"))  

pcp5 <-ggparcoord(RS_3[order(RS_3$condition, decreasing = F), ],
  columns = c(2:16),
  groupColumn = "condition",
  alphaLines = 0.8,
  title = "Parallel Coordinate Plot showing trends for CONDITION 3 users",
  scale = "uniminmax"
  ) +
  scale_color_manual(values = c("maroon", "gray"))+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))



grid.arrange(pcp3, pcp4)

grid.arrange(pcp5, pcp2)

We used parallel coordinate plots to infer general trends between the 15 numerical variables in each of the four conditions, with each line representing one user. We highlighted one specific condition group in rose in each plot. From those plots, we found:

Four plots demostrate similar patterns, indicating little difference between the 4 \(CONDITION\) groups.
Users tend to have opposing \(ITT\) and \(CIIM\_SCL\) values, for those who are high in \(ITT\), they are low in \(CIIM\_SCL\) value and vice versa.
\(Essentialism\_SCL\) and \(Resouce\_allocation\_SCL\) values are positively related. Users who are high in \(Essentialism\_SCL\) are usually high in \(Resouce\_allocation\_SCL\).

3.2.3 Scatterplot Matrix

ggpairs(RA[, 6:12])

Finally, I wanted to explore the relationship between some numerical dependent variables, which were randomly picked from the total 15. All of them are strongly correlated with correlation coefficient greater than 0.5 in absolute values.

RA exercise report

Weijia Bao

2019/6/14