Welcome to this R demo session! Here, I will demonstrate how to use R to conduct profile analysis.
Below is a hypothetical data set appropriate for using profile analysis as an alternative to repeated-measures ANOVA.
We’re exploring a hypothetical scenario: How much do Shakira, Donald Trump, and Dr. Phil value power, money, appearance, and intelligence? It’s important to note that the data set we’re examining, as well as the research question posed, are entirely fictional and created for illustrative purposes.
To answer this question, we have cloned five clones for these three people. Each clone has a chip implanted in their brains to measure their attitude for power, money, appearance, and intelligence. Each clone were also placed in a simulation which resembled their real lives, and their interest in power, money, appearance, and intelligence were recorded with a value between 1-10.
power = c(3,4,4,2,2,8,9,9,8,10,6,7,5,6,6)
money = c(5,7,6,6,5,9,9,9,10,9,8,9,8,9,9)
appearance = c(7,8,8,7,8,2,2,3,2,1,6,7,7,8,7)
intelligence = c(8,7,8,7,7,1,2,2,2,1,8,8,7,9,8)
# For each person, we have five clones
names = rep(c("Shakira", "Donald Trump", "Dr. Phil"), c(5, 5, 5))
profile_data <- data.frame(names, power, money, appearance, intelligence)
Let’s visualize the data using a profile plot, which displays the mean scores by each person (IV) and by each attribute (DV) we are interested in.
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
profile_data_long <- profile_data %>%
pivot_longer(cols = c(power, money, appearance, intelligence), # Specify the columns to be pivoted into long format
names_to = "Variable", # Rename the new column holding the original column names as "Variable"
values_to = "Value") # Rename the new column holding the values from the original columns as "Value"
ggplot(data = profile_data_long, aes(x = as.factor(Variable),
y = Value, group = names)) +
geom_point(aes(color = names),position = position_jitter(width = .1), alpha = .08, size = 2) + # add individual data points
stat_summary(fun = mean,
geom = "point",
aes(color = names)) + # add the mean as a point
stat_summary(fun = mean,
geom = "line",
aes(color = names)) + # add the line between groups
stat_summary(fun.data = mean_cl_boot,
geom = "errorbar",
width = 0.3,
aes(color = names)) + # add error bars
labs(x = "Birth Order",
y = "Mean Scores") + # rename x- and y-axis
scale_color_brewer(palette = "Set1")
From the graph, we can easily see that the individuals being compared seem to have distinct profiles. For example, money seems to be valued the most for Donald Trumps, whereas power shows the highest rating for Dr. Phils. Shakiras score highest on appearance (again, this is totally hypothetical dataset).
From the graph, you can see that the lines have the tendency to cross each other. From our naked eyes we can see that the profiles are not parallel and the lines are not necessarily equal and flat. In addition, among the individuals themselves, they do not seem to value the four attributes in the same way
What the graph suggests is that the individuals do not seem to share the same level of interests for power, money, appearance, and intelligence. At least not in the same pattern and not in the same way.
Now, you might say that we cannot just use a graph to judge whether the profiles are similar or not. We need statistics to do that. You are right, this is where profile analysis comes into play.
Using the pbg
function within the prfileR
package, we can easily perform the profile analysis for testing the
parallelism, coincidental profiles, and flatness.
#install.packages("profileR") # please install the package if you have not done so.
library(profileR)
## Loading required package: RColorBrewer
## Loading required package: reshape
##
## Attaching package: 'reshape'
## The following object is masked from 'package:lubridate':
##
## stamp
## The following object is masked from 'package:dplyr':
##
## rename
## The following objects are masked from 'package:tidyr':
##
## expand, smiths
## Loading required package: lavaan
## This is lavaan 0.6-19
## lavaan is FREE software! Please report any bugs.
# Create a dataset without names as this will be used for the pbg() function.
profile_NoName <- profile_data %>%
select(-names)
model = pbg(profile_NoName, profile_data$names, original.names = T, profile.plot = T) # specifying profile.plot = T displays the profile plot
The profile plot generated by the pbg()
function, while
informative, is not ideally presented — the legend occupies too much
space, which detracts from the overall clarity of the visualization. A
ggplot2
produced plot, in contrast, offers a more refined,
making it a better option for graphical data presentation
Using summary(model)
gives us more detailed results for
the profile analysis.
summary(model)
## Call:
## pbg(data = profile_NoName, group = profile_data$names, original.names = T,
## profile.plot = T)
##
## Hypothesis Tests:
## $`Ho: Profiles are parallel`
## Multivariate.Test Statistic Approx.F num.df den.df p.value
## 1 Wilks 0.008867296 32.06503 6 20 3.025510e-09
## 2 Pillai 1.521533407 11.66007 6 22 6.987849e-06
## 3 Hotelling-Lawley 51.958568149 77.93785 6 18 6.786168e-12
## 4 Roy 50.780651335 186.19572 3 11 1.044885e-09
##
## $`Ho: Profiles have equal levels`
## Df Sum Sq Mean Sq F value Pr(>F)
## group 2 10.68 5.338 26.14 4.23e-05 ***
## Residuals 12 2.45 0.204
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## $`Ho: Profiles are flat`
## F df1 df2 p-value
## 1 54.87772 3 10 1.626441e-06
Parallelism is the main test of interest in a profile analysis by group because the test of parallelism examines whether each segment of a profile is identical. A segment refers to the difference in the values of the same variables across multiple time points or the difference between multiple variables in a single time point. Therefore, the segment is simply the slope of the line between the means of two adjacent variables. Parallelism is assessed using a one-way MANOVA. If the null hypothesis of parallelism is rejected, there is a significant interaction between group membership (groups of clones) and the variables (the four attributes) or group membership and the time points (e.g., if a test is repeatedly administered). In other words, the amount of increase or decrease between successive measurements of the variable is different for at least one of the groups.
In this example, we can see that from the results, all four test statistics (Wilks, Pillai, Hotelling-Lawley, and Roy) related to the test of parallelism show that all p-values are below the .05 level, we can reject the null hypothesis, meaning that the profiles are not parallel. In other words, the three groups of clones do not share the same values for power, money, appearance, and intelligence in the same way.
If the profiles are parallel, one typically tests equality of the levels, which examines whether the profiles coincide (i.e., there are no group differences). This test is used for determining whether at least one group scored higher than other groups, on average, across all the variables or time points. To evaluate this, the grand mean of all time points or variables is calculated for each group. Since all of the time points or variables are collapsed into a group mean, the resulting procedure becomes a univariate test and a between-groups main effect in ANOVA is performed. The test simply measures the relative contributions of between-group and within-group variations to the total sum of squared errors. Based on this test, if the group levels are significantly different from one another, then the null hypothesis of equal levels is rejected. That is, at least one group performs significantly higher or lower than the other groups based on the average of p variables.
From the results, the p-value for the one-way ANOVA is below the significance level, suggesting that the mean scores of the four attributes differ across the three groups of clones.
Note that if the variables are not measured on a comparable scale, it’s important to standardize them to z-scores before advancing with the analysis.
Flatness is a measure of the extent to which the profiles are flat within any group (i.e., there are no differences in the average values of the variables or the average value of a single variable measured across multiple time points), given that the profiles are parallel. The null hypothesis of flatness is that the segments are 0. This is the analog to the profile analysis for one sample except for multiple groups or repeated measurements.
Note that this question is typically relevant only if the profiles are parallel. If the profiles are not parallel, then at least one of them is necessarily not flat. Although it is conceivable that nonflat profiles from two or more groups could cancel each other out to produce, on average, a flat profile, this result is often not of research interest.
In our example, the null hypothesis for this test of flatness is that for Shakira clones, Donald Trump clones, and Dr. Phil clones, they each value the four attributes relatively equally. From the p-value for the test, the null hypothesis is rejected.
The pbg()
function is quite useful for those with a
grasp of profile analysis, as it automates the process of testing for
parallelism, coincident profiles, and flatness. However, to deepen your
understanding, it’s beneficial to know what occurs statistically during
these tests. For this reason, I will guide you through the profile
analysis process using fundamental R code, explaining each step to
ensure clarity and enhance your comprehension of the underlying
methodology.
In order to perform parallelism test from scratch, we need to compute the difference scores. Because essentially the test of parallelism is just a one-way MANOVA of these score differences.
attach(profile_data)
## The following objects are masked _by_ .GlobalEnv:
##
## appearance, intelligence, money, names, power
# Using attach(profile_data) streamlines the process by allowing us to reference variables directly, omitting the need to specify the dataset with each variable call.
PM = power - money
MA = money - appearance
AI = appearance - intelligence
diff = data.frame(names, PM, MA, AI)
diff
## names PM MA AI
## 1 Shakira -2 -2 -1
## 2 Shakira -3 -1 1
## 3 Shakira -2 -2 0
## 4 Shakira -4 -1 0
## 5 Shakira -3 -3 1
## 6 Donald Trump -1 7 1
## 7 Donald Trump 0 7 0
## 8 Donald Trump 0 6 1
## 9 Donald Trump -2 8 0
## 10 Donald Trump 1 8 0
## 11 Dr. Phil -2 2 -2
## 12 Dr. Phil -2 2 -1
## 13 Dr. Phil -3 1 0
## 14 Dr. Phil -3 1 -1
## 15 Dr. Phil -3 2 -1
detach(profile_data)
# One-way MANOVA on DVs' differences
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
options(scipen=999) # control the penalty for displaying numbers in scientific notation (R will favor regular fixed notation when printing numbers)
model = manova(cbind(PM, MA, AI) ~ factor(names))
summary(model, test="Wilks")
## Df Wilks approx F num Df den Df Pr(>F)
## factor(names) 2 0.0088673 32.065 6 20 0.000000003026 ***
## Residuals 12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(model, test="Pillai")
## Df Pillai approx F num Df den Df Pr(>F)
## factor(names) 2 1.5215 11.66 6 22 0.000006988 ***
## Residuals 12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(model, test="Hotelling")
## Df Hotelling-Lawley approx F num Df den Df Pr(>F)
## factor(names) 2 51.959 77.938 6 18 0.000000000006786 ***
## Residuals 12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(model, test="Roy")
## Df Roy approx F num Df den Df Pr(>F)
## factor(names) 2 50.781 186.2 3 11 0.000000001045 ***
## Residuals 12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From our lecture, we understand that to execute the test of coincidental profiles independently, we should carry out a one-way ANOVA. This involves using the average scores across the four focal attributes for each individual as the DV, and the categorization variable—the three clone groups—as the IV.
profile_data <- profile_data %>%
rowwise() %>%
# Create a new column 'mean' to store the row-wise mean
mutate(mean = mean(c_across(c(power, money, appearance, intelligence)), na.rm = TRUE))
model <- aov(mean ~ names, data = profile_data)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## names 2 10.68 5.338 26.14 0.0000423 ***
## Residuals 12 2.45 0.204
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
When we are trying to test the flatness of the profiles, we are interested in knowing whether the adjacent DVs differ from each other. We can convert this question to examine whether the difference scores for each pairs of adjacent DVs equal to zero.
Again, this test is usually irrelevant if the test of parallelism is significant (showing an interaction between IV groups and DVs). However, we are still performing the test for demonstration purposes.
We use Hotellings T2 test to test the flatness. The null hypothesis for this test is that the mean vector of the difference scores (PM, MA, AI) is equal to the vector c(0, 0, 0), which represents no difference between the compared dependent variables (profiles are flat).
# install.packages("ICSNP")
library(ICSNP)
## Loading required package: mvtnorm
## Loading required package: ICS
Shakira = HotellingsT2(diff[1:5, 2:4], mu=c(0, 0, 0))
Shakira
##
## Hotelling's one sample T2-test
##
## data: diff[1:5, 2:4]
## T.2 = 28.525, df1 = 3, df2 = 2, p-value = 0.03406
## alternative hypothesis: true location is not equal to c(0,0,0)
Trump = HotellingsT2(diff[6:10, 2:4], mu=c(0, 0, 0))
Trump
##
## Hotelling's one sample T2-test
##
## data: diff[6:10, 2:4]
## T.2 = 187.67, df1 = 3, df2 = 2, p-value = 0.005305
## alternative hypothesis: true location is not equal to c(0,0,0)
Phil = HotellingsT2(diff[11:15, 2:4], mu=c(0, 0, 0))
Phil
##
## Hotelling's one sample T2-test
##
## data: diff[11:15, 2:4]
## T.2 = 81.833, df1 = 3, df2 = 2, p-value = 0.0121
## alternative hypothesis: true location is not equal to c(0,0,0)
From the results, we can determine that the null hypothesis is rejected, meaning that for Shakira, Donald Trump, and Dr. Phil, they do not have a flat profiles. In other words, for each individual, they do not value these four attributes equally.
Closing remarks:
You may decide whether to conduct a post-hoc analysis depending on the research question and/or the MANOVA and ANOVA results obtained from the profile analysis to identify the differences. For more details on post-hoc analysis, please refer back to the relevant sections in previous lectures.