Introduction

This code-through is an extension of my previous code-through which explained how to create a Sankey diagram for educational research by using a sample generated data set (Roach, 2022). However, instead of exploring Sankey diagrams, this code-through explores how to create and interpret correlation plots in educational research using a similar outline and data set as before to bolster clarity and comprehension.

Specifically, this code-through will cover:

What educational research is and how correlation plots are valuable in educational research.
Examples of sample data collected in research exploring student leadership types present and their relationship with aspects of science-like talk during small group activities.
How to create correlation plots visualizing the relationships between student leadership type and the amount of disagreement, self generated claims, and correct claims occurring in their conversations.
How to interpret information from these correlation plots.

Background Information

Educational Research

As previously discussed, educational research studies influential factors of educational outcomes and learning processes. These studies can be either qualitative, quantitative, or a combination of the two (Roach, 2022). Since 2020 I have worked for Washington State University’s Department of Teaching and Learning on a National Science Foundation-funded project exploring deliberative argumentation in a large-lecture undergraduate biology course. As the data collection stage has been completed we are shifting to analyzing the data. One focus of analysis is on how student leadership style in small groups influences student discussion throughout in-class activities. Specifically, how the type of leadership style impacts the amount of deliberative argumentation present, as determined by the percentage of disagreement between group members and the voicing of self-generated claims. Similarly, we also want to know if a specific leadership style is related to students generating scientifically accurate and/or plausible hypotheses.

Correlation and Correlation Plots

A correlation is a statistical measure used to depict relationships between variables (Jaadi, 2019). Correlations can be extremely useful if they are interpreted correctly. However, it is important to remember that correlations are determined when using sample data, which does not always accurately represent the population from which that data was collected. Therefore any apparent correlation in a sample needs to be accompanied by a significance test in order to assess the likelihood that this correlation is not due to chance alone. Even if a correlation is found to be statistically significant at a designated confidence level, this is not an indication of causality. Just because two variables have a statistically significant relationship does not mean it can be inferred that one causes the other (Jaadi, 2019).

A correlation’s sign, form, and correlation size all imply different things about the relationship between two variables (Jaadi, 2019):

Correlation Signs:

Positive: both variables move in the same direction, meaning they either simultaneously increase or decrease.
Negative: the variables move inversely of each other, meaning as one increases the other decreases
Neutral: the variables do not appear to have a relationship to one another

Correlation Forms:

Linear: the variables change at a constant rate and graph linearly ( y = mx + b)
Non-Linear: the variables do not change at a constant rate and graphs as a curved pattern
Monotonic: the variables change in the same direction but not at a constant rate, graphing as a wavy line

Correlation Size:

+/- 0.90-1.00: Very high correlation
+/- 0.70-0.90: High correlation
+/- 0.50.-0.70: Moderate correlation
+/- 0.30-0.50: Low positive correlation
+/- 0.00-0.30: Negligible correlation (neutral)

Relevance

Correlation plots are matrices comprised of multiple correlation subplots for each variable of interest. They allow us to see all of the desired correlations at once, and can be customized to include the relevant data we wish to know including correlation sign, form, size, and significance.

Given that in educational research there are often many variables of interest, the ability to easily view all relationships simultaneously is extremely helpful. In turn, the information they present can be used to help answer a research question, such as:

“What is the relationship between leadership type and the percentage of disagreement, claims, and correct claims?”

Now that we understand a bit more about what a correlation plot is, what information is presented, and how to interpret that information, let’s learn how to create a correlation plot in R.

Steps for Evaluating Student Leadership Type by Using Correlation Matrices:

Complete the following steps to create a correlation plot:

Import/create a sample data frame containing student data.
Check the data variable distributions for skew
Prepare the data for correlation
Create a single correlation plot
Create a correlation matrix
Further customize a correlation matrix
Interpreting a correlation matrix

Step 1: Adding Data

Disclaimer: The sample data provided below is inspired by my work with Washington State University’s Department of Teaching and Learning. The sample data provided is not based on actual student data and was merely created to use as an example.

In this sample data there are three leadership styles, two of which were previously seen in the Sankey diagram code-through (Roach, 2022). These leadership styles include:

Tyrant: A confident, domineering leader who assumes power without consulting other group members
Partnership: When two or more group members equally share leadership
Reluctant: An unsure leader that arises out of group necessity after consulting with other group members

Let’s start by loading the necessary packages and importing the sample data.

library(readxl) # import data from Excel files
library(dplyr)  # data wrangling
library(pander) # easy viewing of data
library(magick) # viewing images from URL
library(psych)  # creating/customizing correlation plots

sample.data <- read_excel("sample.data.xlsx") # put document name in the  " "
(head(sample.data, 3))  # preview the top 3 rows of data

As you can see we have data about the group number, the leadership style present in each group, the total amount of lines of talk the group had, the frequencies and percentages of talk that were disagreements, the frequencies and percentages of talk that were self-generated claims, and the frequencies and percentages of talk that were scientifically accurate claims.

These raw frequencies were converted into percentages to normalize the data based on the total number of lines of talk each group had. The reasoning behind this is that even if two groups had 5 lines of disagreement, this doesn’t tell us much unless we put it in terms relative to the length of a group’s conversation. A group who has 5 lines of disagreement out of 200 lines total (2.5%) means something different than a group who has 5 lines of disagreement out of 400 lines total (1.25%). Therefore, by normalizing the data in this way it allows for more fair, apples-to-apples comparisons.

Step 2: Look at Data Distributions

The next important step is to look at the distribution of our data. As we learned in this course, the distribution of our data can impact how we go about making a correlation plot CPP 529. If the data is normally distributed, such as in a bell curve, than no additional correction is needed. However, if the data distribution is highly skewed, in either direction, than correction, often in the form of logging the data, is needed to accurately asses the relationship between two variables CPP 529. Below is an example of what skewed left or right data looks like in comparison to normally (symmetrically) distributed data:

This image is courtesy of CalcWorkshop

An easy way to assess if a variable;s distribution needs correction via logging is by creating a histogram of both the raw and logged versions. If the logged version looks similar to the raw version, or more skewed than the raw version, this is a sign that the data does not need to be corrected for skew CPP 529.

Let’s look at the distributions of our sample.data and determine which, if any, need correction.

# group.talk
par( mfrow=c(1,2) ) # allows for side-by-side histograms

# create a histogram of raw measures
hist( sample.data$group.talk, # data we want a histogram of
      breaks=25,              # how many breaks in the histogram
      col="goldenrod2",       # color of histogram
      border="white",         # border color
      yaxt="n",               # y-axis type, "n" suppresses plotting
      xlab="Lines of Talk",   # x-axis label
      ylab="",                # y-axis label
      xlim = range(300,500),  # specify x-axis range
      main="Total Group Talk Lines" ) # histogram title
# create a histogram of logged measures
hist( log(sample.data$group.talk + 1), # log the data used above
      breaks=25,
      col="goldenrod2", 
      border="white",
      yaxt="n",
      xlab="Lines of Talk (logged)", # adjust the label accordingly
      ylab="", 
      xlim = range(5.75, 6.25),
      main="Total Group Talk Lines" )

# ADAPT THIS CODE FOR EACH VARAIBLE OF INTEREST BY CHANGING THE SUBSET OF DATA SELECTED, HISTOGRAM TITLE, AND LABELS. CHANGING THE COLOR HELPS DIFFERENTIATE THE HISTOGRAMS.

# percent.disagree
par( mfrow=c(1,2) ) # allows for side-by-side histograms

# create a histogram of raw measures
hist( sample.data$percent.disagree, breaks=25, col="hotpink", 
      border="white", yaxt="n", xlab=" Percent Lines of Talk", ylab="",
      xlim = range(0.5, 3.0), main="Percent Disagreement" )
# create a histogram of logged measures
hist( log(sample.data$percent.disagree + 1), breaks=25, col="hotpink",
      border="white", yaxt="n", xlab=" Percent Lines of Talk (logged)",
      ylab="", xlim= range(0.4, 1.4), main="Percent Disagreement" )

# percent.claim
par( mfrow=c(1,2) ) # allows for side-by-side histograms

# create a histogram of raw measures
hist( sample.data$percent.claim, breaks=25, col="yellowgreen",
      border="white", yaxt="n", xlab=" Percent Lines of Talk", ylab="",
      xlim = range(0, 3),main="Percent Claims" )
# create a histogram of logged measures
hist( log(sample.data$percent.claim + 1), breaks=25, col="yellowgreen",
      border="white", yaxt="n", xlab=" Percent Lines of Talk (logged)",
      ylab="", xlim = range(0, 1.4), main="Percent Claims" )

# percent.correct
par( mfrow=c(1,2) ) # allows for side-by-side histograms

# create a histogram of raw measures
hist( sample.data$percent.correct, breaks=25, col="darkgreen",
      border="white", yaxt="n", xlab=" Percent Lines of Talk", ylab="",
      xlim = range(0,2), main="Percent Correct Claims" )
# create a histogram of logged measures
hist( log(sample.data$percent.correct + 1), breaks=25, col="darkgreen",
      border="white", yaxt="n", xlab=" Percent Lines of Talk (logged)",
      ylab="", xlim = range(0,1), main="Percent Correct Claims" )

As we can see above, all of our variable distributions are relatively normal, and logging them either doesn’t impact the distribution (it remains relatively normal), or it exacerbates the skew. Therefore, going forward with our creation of a correlation plot we will be using the raw measures for all variables.

Step 3: Preparing the Data for Correlation

Before creating a correlation matrix, one important thing to consider is the overarching focus of our analysis. For example, do we want to understand the over all relationship between group total talk and group disagreement for all groups? If so, we would use all the data available for our numeric variables.

However, what if we want to look at this relationship for a specific subset, such as only for groups with tyrannical leaders? While correlation plots can provide a lot of information and can be personalized to enhance data visualization, they can only work with numeric data. Categorical data, such as type of leadership, doesn’t work. As a result, we must create different correlation plots for each of our leadership types by using sub-setting the data based on leadership type.

Therefore, we have to do a bit of data preparation since we already identified that we need to log-correct any skewed data and have learned that we can’t use categorical variables.

While in this instance we don’t need to log-correct any of the variables, I will include an example of how it would be done for the percent.disagreement variable within the sample.data data frame.

# log-correcting variables identified to have skew
# create a new vector with the logged data for each variable
logged.disagree <- log(sample.data$percent.disagree + 1) 

# add log-corrected variable(s) back into sample.data
sample.data$logged.disagree <- logged.disagree 

head(sample.data, 1) # preview the data set to confirm the new vector was added

By adding the log-corrected variable back into our original data frame as a separate variable it improves clarity and makes data-traceback much easier.

Now our next step is to subset and create four new data frames that we can use to create our correlation plots. These four data frames will include:

corr.data which will have all the numeric data for all groups, meaning it will not have the group number or leadership style variables.
tyrant.data which will have all the numeric data for groups with the tyrannical leadership style.
partnership.data which will have all the numeric data for groups with the partnership leadership style.
reluctant.data which will have all the numeric data for groups with the reluctant leadership style.

# new data frame of variables of interest for ALL groups
corr.data <- sample.data %>% 
               select(group.talk, percent.disagree, percent.claim,
                      percent.correct) # use select to retain only the numeric variables

# tyrant leadership data frame
tyrant.data <- sample.data %>% 
                 filter(leadership.style == "tyrant") %>% # filter by leadership type
                 select(group.talk, percent.disagree, percent.claim,
                        percent.correct)

# partnership leadership data frame
partnership.data <- sample.data %>% 
                      filter(leadership.style == "partnership") %>%
                      select(group.talk, percent.disagree, percent.claim,
                             percent.correct)

# reluctant leadership data frame
reluctant.data <- sample.data %>% 
                    filter(leadership.style == "reluctant") %>%
                    select(group.talk, percent.disagree, percent.claim,
                           percent.correct)

Great! Now we have created four different data frames for the different correlation plots we want to create.

Step 4: Plotting Correlation Between Two Variables in R

There are many ways we can plot correlation in R. The most simple way is to create a scatter plot of two variables.

# data generation
set.seed(1) # starting point for random scatter plot point generation

# creating the plot
plot(sample.data$group.talk,    # first variable
     sample.data$percent.claim, # second variable
     pch = 19,                  # point symbol
     col = "goldenrod2",        # point color
     xlab = "Total Group Talk", # x-axis label
     ylab= "Group Claims %")    # y-axis label

# create a regression line
abline(lm(sample.data$percent.claim ~ sample.data$group.talk), # regression equation
       col = "red", # regression line color
       lwd = 3)     # regression line width

# Pearson correlation pasted in correlation plot
text(paste("Correlation:", # paste the correlation value in the scatter plot
     # round the correlation to 2 decimal points     
     round(cor(sample.data$group.talk, sample.data$percent.claim), 2)),
     x = 360,    # x coordinate for pasted correlation value
     y = 2.0,    # y coordinate for pasted correlation value
     col= "red") # color of pasted correlation value

Sample code courtesy of R CODER

However, creating an individual scatter plot is only helpful when there are only two variables to compare. In our case there are more than that. Therefore, we are going to want to create a larger correlation matrix, often called a correlogram, in order to determine the relationships between all of our variables (R Coder, n.d.).

Step 5: Plotting Correlation Matrices in R

Arguably the easiest way to create a correlation matrix is using the pairs ( ) function, which creates a correlation plot from a data frame (R Coder, n.d.). This function is particularly handy because it can quickly create scatter plots for each pairing of the variables in the data frame, while presenting them all at once. However, what makes a correlation matrix particularly useful is the ability to customize it. That is why the arguments within the pairs ( ) function are so handy. Below are examples of a plain correlation matrix using our corr.data and a customized correlation matrix using that same data.

# plain correlation matrix
pairs(corr.data)

# customized correlation matrix
pairs( corr.data,                               #data frame of variables
       labels= c("Total Talk", "% Disagreement",# variable labels
                 "% Claims", "% Correct Claims"),
       pch= 21,                                 # point symbol
       bg= "deeppink",                          # symbol background color
       col= "deeppink",                         # symbol border color
       main= "All Leadership Types",            # plot title
       row1attop = TRUE,   # If FALSE, changes direction of diagonal
       gap = 1,                                 # Distance between subplots
       cex.labels = NULL,                       # Size of label text
       font.labels = 2)                         # Font style of label text

Sample code courtesy of R CODER

Step 6: Further Customization of Correlation Plots

For additional customization, the pairs.panel ( ) function of the pysch package is a great option to customize a correlation matrix beyond what the pairs ( ) function is capable of. With this function additional information can be adjusted within, or added to, the correlation plot including (RDocumentation):

Either linear or LOESS regression lines with or without confidence intervals and specified alpha level
Density plots and histograms with color and break adjustments
Correlation ellipses indicating the densest region of points which can help visualize correlation strength and direction
Correlation method parameter
Report correlation values and change their text size
Allow for weight specification of the correlations
Statistical significance indications

The pairs.panel ( ) function was adapted from the panel.cor ( ), panel.cor.scale ( ), panel.hist ( ), and ellipse ( ) functions. It is particularly useful for plotting less than 10 variables, and is excellent for giving an overview of the data (RDocumentation).

Let’s try further customizing our corr.data correlation plot.

pairs.panels(corr.data,          # data frame used
             labels= c("Total Talk", "% Disagreement", # variable labels
                       "% Claims", "% Correct Claims"),
             smooth = TRUE,      # TRUE draws a smooth LOESS line
             scale = FALSE,      # TRUE scales corr. text by corr. size
             density = TRUE,     # TRUE includes density plots/histograms
             ellipses = TRUE,    # TRUE draws correlation ellipses
             method = "pearson", # corr. method ("spearman, "kendall")
             pch = 21,           # point symbol
             col= "yellowgreen", # point border color
             bg= "yellowgreen",  # point background color
             lm= FALSE,          # Plots linear fit instead of LOESS
             digits= 2,          # number of significant digits to include
             cor = TRUE,         # TRUE reports correlations
             jiggle = FALSE,     # TRUE jitters points before plotting 
             factor = 2,         # jitter factor
             hist.col = "yellowgreen",# histogram color
             show.points = TRUE, # shows data points
             rug = TRUE,         # TRUE draws rug under histogram
             breaks = 10,        # specify histogram breaks
             smoother = FALSE,   # TRUE smooth.scatter the data points
             stars = TRUE,       # statistical significant indication
             ci= FALSE)           # Model fit confidence interval

Sample code courtesy of R CODER and RDocumentation

Sweet! Now we know how to create and customize our correlation plots. Let’s go ahead and create and customize our other three correlation plots for each of the leadership styles.

# tyrant correlation plot
pairs.panels(tyrant.data,        # data frame used
             labels= c("Total Talk", "% Disagreement", # variable labels
                       "% Claims", "% Correct Claims"),
             smooth = TRUE,      # TRUE draws a smooth LOESS line
             density = TRUE,     # TRUE includes density plots/histograms
             ellipses = TRUE,    # TRUE draws correlation ellipses
             method = "pearson", # corr. method ("spearman, "kendall")
             pch = 21,           # point symbol
             col= "darkturquoise", # point border color
             bg= "darkturquoise",  # point background color
             digits= 2,          # number of significant digits to include
             cor = TRUE,         # TRUE reports correlations
             hist.col = "darkturquoise",# histogram color
             show.points = TRUE, # shows data points
             breaks = 10,        # specify histogram breaks
             stars = TRUE,)      # statistical significant indication

# partnership correlation plot
pairs.panels(partnership.data,   # data frame used
             labels= c("Total Talk", "% Disagreement", # variable labels
                       "% Claims", "% Correct Claims"),
             smooth = TRUE,      # TRUE draws a smooth LOESS line
             density = TRUE,     # TRUE includes density plots/histograms
             ellipses = TRUE,    # TRUE draws correlation ellipses
             method = "pearson", # corr. method ("spearman, "kendall")
             pch = 21,           # point symbol
             col= "orchid1",     # point border color
             bg= "orchid1",      # point background color
             digits= 2,          # number of significant digits to include
             cor = TRUE,         # TRUE reports correlations
             hist.col ="orchid1",# histogram color
             show.points = TRUE, # shows data points
             breaks = 10,        # specify histogram breaks
             stars = TRUE,)      # statistical significant indication

# reluctant correlation plot
pairs.panels(reluctant.data,     # data frame used
             labels= c("Total Talk", "% Disagreement", # variable labels
                       "% Claims", "% Correct Claims"),
             smooth = TRUE,      # TRUE draws a smooth LOESS line
             density = TRUE,     # TRUE includes density plots/histograms
             ellipses = TRUE,    # TRUE draws correlation ellipses
             method = "pearson", # corr. method ("spearman, "kendall")
             pch = 21,           # point symbol
             col= "darkorange",  # point border color
             bg= "darkorange",   # point background color
             digits= 2,          # number of significant digits to include
             cor = TRUE,         # TRUE reports correlations
             hist.col = "darkorange",# histogram color
             show.points = TRUE, # shows data points
             breaks = 10,        # specify histogram breaks
             stars = TRUE,)      # statistical significant indication

We have finally finished creating and customizing all of our correlation plots. However, we still need to interpret them in order to begin to answer our research questions.

Step 7: Interpretting Correlation Plots

As mentioned previously, our research question for this code-through included:

“What is the relationship between leadership type and the percentage of disagreement, claims, and correct claims?”

Looking at our four different correlation plots we can determine the following:

When looking across all leadership styles using corr.data we see that there are moderate, positive, monotonic relationships between percent disagreement, claims, and correct claims. Specifically percent disagreement and percent claims have a correlation coefficient of 0.49, percent disagreement and percent correct claims have a coefficient of 0.6, and percent claims and percent correct claims have a coefficient of 0.52. Therefore, we can infer that as percent disagreement increases so does percent claims, and in turn, as percent claims increases so does the percent of correct claims. Logically this makes sense. Typically increased amounts of disagreement represents increased levels of “science-like talk”, including the questioning, evaluation, and discussion of peoples’ ideas. As a result, with increased discussion of ideas, it is logical that more claims would be presented for evaluation. Since these ideas are being evaluated as a group, the hope is that the probability of generating a correct claim would be higher.
However, when looking at the correlation plots of each specific leadership style, we see that there are no statistically significant relationships between percent disagreement, percent claims, and percent correct claims. Does this inherently mean that there are no meaningful relationships between leadership styles and the variables of interest? Not necessarily. As stated previously, correlation plots only reveal the relationships between variables in a sample, which is not always indicative of a population (Jaadi, 2019). While there may be meaningful relationships between these variables in the real world (AKA all students in the class), our sample may not be a good representation of that. Additionally, when it comes to significance tests another important factor is the sample size. The smaller the sample size, the less likely it is that it accurately represents the population being drawn from. As a result, the statistical power of our test is reduced (Jaadi, 2019).

If we were to look at all of our sample.data we could see that there was a total of 75 groups, with 25 groups in each type of leadership style:

sample.data %>% count(leadership.style) %>% pander()

leadership.style	n
partnership	25
reluctant	25
tyrant	25

According to central limit theorem, a sample size of 30 is necessary to have sufficient statistical power to detect relationships, unless the population is normally distributed (LaMorte, 2016). Since we are unable to prove that the population our sample data was drawn from is completely normally distributed across all variables of interest, the only correlation plot that actually had sufficient power to detect meaningful relationships was the correlation plot using the corr.data. Therefore, the only thing we can actually conclude with regard to our research question is that we do not have sufficiently large sample size for each leadership type to detect significant relationships between our variables of interest.

While the sample data used in this code-through was randomly generated in Excel, the problem of lacking statistical power is one that my team and I face frequently when conducting educational research. Our sample size it frequently reduced from the desired level due to microphones not working properly or being muffled by clothing and study volunteers working with non-volunteers, which eliminates their transcript from being considered in our research due to Institutional Review Board (IRB) limitations in research using human subjects. This inherently limits how we can interpret the analyses of our data. While this can be incredibly frustrating, it is also a common problem in educational research, particularly in talk-data analysis.

However, the silver lining of it all is that even if we lack sufficient statistical power to detect relationships between leadership style and our variables of interest, we often have sufficient power to analyze general trends across the whole sample. This can help reinforce our logic of how students interact with one another in small group activities. Additionally, though our quantitative analysis abilities are limited for leadership type, we can still conduct qualitative case studies of each leadership type and capture trends we personally notice and the context in which they occur.

Further Resources

Learn more about the research project I am a part of, various ways to create correlation plots in R, and the pairs.panel ( ) function by visiting the following:

National Science Foundation Research Grant Award Abstract # 1822490 nsf.gov/awardsearch
Correlation Plot in R. R CODER
RDocumentation for pairs.panels. RDocumentation

Works Cited

This code through references and cites the following sources:

Correlation plot in R. (n.d.). R CODER.
Descriptive Analysis of Community Change. (n.d.). CPP 529 Course Website.
LaMorte, Wayne. (2016). The Role of Probablity- Central Limit Theorem. Boston University, School of Public Health.
Jaadi, Zakaria. (2019). Everything you need to know about interpreting correlations. Towards Data Science.
pairs.panels: SPLOM, histograms and correlations for a data matrix. (n.d.). RDocumentation.
Roach, Dana. (2022). Introduction to Creating Sankey Diagrams for Educational Research in R. RPubs.
What’s the difference between Qualitative (Categorical) Data and Quantitative (Numerical) Data? (2020, September 20). CalcWorkshop.

Introduction to Creating Correlation Plots for Educational Research in R

Dana Roach

30 April 2022