Introduction

For teachers, psychologists, researchers, and friends and family member, people are highly-complex and not easily able to be characterized by a single characteristic or personality trait. In the social sciences, broadly, and in psychology, particularly, a statistical method that can be used to describe how people, in their individual particularities, may have similarities on the basis of some set of measures through which they can be grouped in meaningful, distinctive ways. This approach, which has a provenance in developmental approaches (Bergman & El-Khouri, 1997; Magnusson & Cairns, 1996), though is now widely used in educational, social, personality, and other domains of psychology, is an example of a general mixture model (Harring & Hodis, 2016; Pastor, Barron, Miller, & Davis, 2007).

In this tutorial, we aim to describe one of the most commonly-used–and relevant to psychologists–application of the general mixture model, to cases for which all of the variables for which (relatively) homogeneous groups are identified from among a (relatively) heterogeneous sample are continuos, latent profile analysis (LPA). After describing the method and some examples of its use, we provide a tutorial for carrying out LPA in the context of a freely-available, open-source statistical software package we created for R (R Core Team, 2019), tidyLPA. Finally, we offer some informed recommendations for researchers aiming to use LPA in their applied work, and conclude with reflections on the role of statistical software–especially software that is freely-available, open-source, and highly-performant–in the psychological sciences.

Latent Profile Analysis

The goal of LPA is estimate the parameters for a number of distributions (typically multivariate) from a single dataset. Thus, such an approach is model-based, and some descriptions in the literature refer to it as model-based clustering (Hennig, Meila, Murtagh, & Rocci, 2015; Scrucca, Fop, Murphy, & Raftery, 2017). Thus, one distinction between LPA and other, similar cluster analytic approaches is that LPA is model-based; instead of using algorithms to group together cases, LPA seeks to estimate parameters - in terms of variances and covariances and how they are the same or different across profiles - that best characterize the different distributions. Then, this approach seeks to assign to each observation a probability that the observation is a sample from the population associated with each profile (or mixture component).

Because LPA is model-based, a number of different model parameterizations can be estimated. These models differ in terms of whether–and how–parameters are estimated across the profiles. These parameters are the means for the different profiles, which, in this approach, always are estimated freely across the profiles; the variances for the variables used to create the profiles, which can be estimated freely or can be estimated to be the same, or equal, across profiles; and the covariances of the variables used to create the profiles, which can be freely-estimated, estimated to be equal, or fixed to be zero.

We wish to note that challenge facing the analyst using LPA is that these parameters and the distinct model parameterizations that can be estimated is the different terminology used. As one example, Scrucca et al. (2017) refer to these parameterizations not in terms of whether and how parameters are estimated, but rather in terms of the geometric properties of the distributions that result from particular parameterizations. Muthen and Muthen (1997-2017) and others (Pastor et al., 2007) commonly refer to local independence to mean that the covariances are fixed to zero (also described as the specification of the covariance matrix as “diagonal,” because only the diagonal components, or the variances, are estimated).

In general, as more parameters are estimated (i.e., those that are fixed to zero are estimated as being equal across profiles; or those estimated as being equal across profiles are freely-estimated across them), the model becomes more complex; the model may fit better, but also be overfit, meaning that the profiles identified may be challenging to replicate with another, separate data set. Even still, flexibility in terms of which models can be estimated also has affordances. For example, the varying means, equal variances, and covariances fixed to 0. A researcher might choose this model specification if she wants to model the variables to be used to create profiles that are independent. This model is very simple, as no covariances are estimated and the variances are estimated to be the same across profiles. As we estimate more parameters (and decrease the degrees of freedom), we are more likely to fit the data, but less likely to be able to replicate the model with a second set of data. In other words, more parameters may mean a loss of external validity. As we progress toward more complex models (with increasingly complex parameterization), then we are more likely to fit the data better. More information on the model parameterizations are discussed in the context of the software tool tidyLPA that we have developed.

A note of caution is warranted about LPA in the context of their potential. Bauer (2007) notes that many samples of data can be usefully broken down into profiles, and that the addition of profiles will likely be suggested for reasons other than the samples coming from more than one distribution (i.e., due to non-normality in the variables measured). Bauer also cautions that profiles should not be reified; that profiles do not necessarily exist outside of the analysis that they should be interpreted more as useful interpretative devices. These cautions suggest that, in general, parsimony, interpretability, and a general sense that the profiles are not necessarily real, but are rather helpful analytic tools, should be both priorities for the analyst and the reader of studies using this approach.

Specific Software Tools

Other Tools

SPSS is a common tool to carry out cluster analyses (but not LPA). While somewhat straightforward to carry out, particularly in SPSS’s graphical user interface (GUI), there are some challenges to use of this approach. The GUI in SPSS can be challenging, even for the most able analyst, to be able to document every step with syntax, and so reproducing the entire analysis efficiently can be a challenge, both for the analyst exploring various solutions and for the reviewer looking to replicate this work. Additionally, SPSS is commercial software (and is expensive), and so analysts without the software cannot carry out this analysis.

Another common tool is MPlus (Muthen & Muthen, 20xx). MPlus is a commercial tool that provides functionality for many latent variable (and multi-level) models. We will speak more to MPlus in the section on tidyLPA, as our software provides an interface to both it and an open-source tool.

In R, a number of tools can be used to carry out LPA. OpenMx can be used for this purpose (and to specify almost any model possible to specify within a latent variable modeling approach). However, while OpenMx is very flexible, it can also be challenging to use. Other tools in R allow for estimating Gaussian mixture models, or models of multivariate Gaussian (or normal) distributions. In this framework, the term “mixture component” has a similar meaning to a profile. While much more constraining than the latent variable modeling framework, the approach is often similar or the same: the EM algorithm is used to (aim to) obtain the maximum likelihood estimates for the parameters being estimated. Like in the latent variable modeling framework, different models can be specified.

In addition to following the same general approach, using tools that are designed for Gaussian mixture modeling have other benefits, some efficiency-related (see RMixMod, which uses compiled C++ code) and others in terms of ease-of-use (i.e., the plot methods built-in to RMixMod, mclust, and other tools). However, they also have some drawbacks, in that it can be difficult to translate between the model specifications, which are often described in terms of the geometric properties of the multivariate distributions being estimated (i.e., “spherical, equal volume”), rather than in terms of whether and how the means, variances, and covariances are estimated. They also may use different default settings (than those encountered in MPlus) in terms of the expectation-maximization algorithm, which can make comparing results across tools challenging.

tidyLPA walkthrough

Because of the limitations in other tools, we set out to develop a tool that a) provided sensible defaults and were easy to use, but provided the option to access and modify all of the inputs to the model (i.e., low barrier, high ceiling), b) interfaced to existing tools,, and are able to translate between what existing tools are capable of and what researchers and analysts carrying-out person-oriented analyses would like to specify, c) made it easy to carry-out fully-reproducible analyses and d) were well-documented.

This package focuses on models that are commonly specified as part of LPA. Because MPlus is so widely-used, it can be helpful to compare output from other software to MPlus. The functions in tidyLPA that use mclust have been benchmarked to MPlus for a series of simple models (with small datasets and for models with small numbers of profiles. This R Markdown output contains information on how mclust and Mplus compare. The R Markdown to generate the output is also available here, and, as long as you have purchased MPlus (and installed MplusAutomation), can be used to replicate all of the results for the benchmark. Note that most of the output is identical, though there are some differences in the hundreths decimal places for some. Because of differences in settings for the EM algorithm and particularly for the start values (random starts for MPlus and starting values from hierarchical clustering for mclust), differences may be expected for more complex data and models.

One way that tidyLPA is designed to be easy to use is that it assumes a “tidy” data structure (Wickham, 2014). This means that it emphasizes the use of a data frame as both the primary input and output of functions for the package. Because data is passed to and returned (in amended form, i.e., with the latent profile probabilities and classes appended to the data) from the function, it makes it easy to create plots or use results in subsequent analyses. Another noteworthy feature of tidyLPA is that it provides the same functionality through two different tools, one that is open-source and available through R, the mclust package (Scrucca et al., 2017) and one that is available through the commercial software MPlus (Muthen & Muthen, 1997-2017). Moreover, as both tools use the same maximum likelihood estimation procedure, they are benchmarked to produce the same output. Also, note that we have described the model specifications with descriptions of what is estimated in terms of the variances and covariances, the common names for the models (i.e., class-varying unrestricted), and the covariance matrix associated with the parameterizationfor the six models that are possible to be estimated on the website for tidyLPA (see here).

Installation

You can install tidyLPA from CRAN with:

install.packages("tidyLPA")

You can also install the development version of tidyLPA from GitHub with:

install.packages("devtools")
devtools::install_github("data-edu/tidyLPA")

Examples

From the NSSME

Option 1

Plan instruction so students at different levels of achievement can increase their understanding of the ideas targeted in each activity
Teach math/science to students who have learning disabilities
Teach math/science to students who have physical disabilities
Teach math/science to English-language learners
Provide enrichment experiences for gifted students
Encourage students’ interest in math/science and/or engineering
Encourage participation of females in math/science and/or engineering
Encourage participation of racial or ethnic minorities in math/science and/or engineering
Encourage participation of students from low socioeconomic backgrounds in math/science and/or engineering
Manage classroom discipline

Option 2

Students learn math/science best in classes with students of similar abilities.
Inadequacies in students’ math/science background can be overcome by effective teaching.
It is better for math/science instruction to focus on ideas in depth, even if that means covering fewer topics.
Students should be provided with the purpose for a lesson as it begins.
At the beginning of instruction on a math/science idea, students should be provided with definitions for new scientific vocabulary that will be used.
Teachers should explain an idea to students before having them consider evidence that relates to the idea.
Most class periods should include some review of previously covered ideas and skills.
Most class periods should provide opportunities for students to share their thinking and reasoning.
Hands-on/laboratory activities should be used primarily to reinforce a math/science idea that the students have already learned.
Students should be assigned homework most days.
Most class periods should conclude with a summary of the key ideas addressed.

Mclust via tidyLPA

Here is a brief example using the built-in pisaUSA15 data set and variables for broad interest, enjoyment, and self-efficacy. Note that we first type the name of the data frame, followed by the unquoted names of the variables used to create the profiles. We also specify the number of profiles and the model. See ?estimate_profiles for more details.

In these examples, we pass the results of one function to the next by piping (using the %>% operator, loaded from the dplyr package). We pass the data to a function that selects relevant variables, and then to estimate_profiles:

library(tidyLPA)
library(dplyr)

pisaUSA15[1:100, ] %>%
    select(broad_interest, enjoyment, self_efficacy) %>%
    single_imputation() %>%
    estimate_profiles(3)

## tidyLPA analysis using mclust: 
## 
##  Model Classes AIC    BIC    Entropy prob_min prob_max n_min n_max BLRT_p
##  1     3       635.55 672.03 0.80    0.86     0.94     0.03  0.64  0.01

Mplus via tidyLPA

We can use Mplus simply by changing the package argument for estimate_profiles():

pisaUSA15[1:100, ] %>%
    select(broad_interest, enjoyment, self_efficacy) %>%
    single_imputation() %>%
    estimate_profiles(3, package = "MplusAutomation")

A simple summary of the analysis is printed to the console (and its posterior probability). The resulting object can be further passed down a pipeline to other functions, such as plot, compare_solutions, get_data, get_fit, etc. This is the “tidy” part, in that the function can be embedded in a tidy analysis pipeline.

If you have Mplus installed, you can call the version of this function that uses MPlus in the same way, by adding the argument package = "MplusAutomation.

Plotting the profiles

We can plot the profiles by piping the output to plot_profiles().

pisaUSA15[1:100, ] %>%
    select(broad_interest, enjoyment, self_efficacy) %>%
    single_imputation() %>%
    scale() %>%
    estimate_profiles(3) %>% 
    plot_profiles()

Model specification

In addition to the number of profiles (specified with the n_profiles argument), the model can be specified in terms of whether and how the variable variances and covariances are estimated.

The models are specified by passing arguments to the variance and covariance arguments. The possible values for these arguments are:

variances: “equal” and “zero”
covariances: “varying”, “equal”, and “zero”

If no values are specified for these, then the variances are constrained to be equal across classes, and covariances are fixed to 0 (conditional independence of the indicators).

These arguments allow for four models to be specified:

Equal variances and covariances fixed to 0 (Model 1)
Varying variances and covariances fixed to 0 (Model 2)
Equal variances and equal covariances (Model 3)
Varying variances and varying covariances (Model 6)

Two additional models (Models 4 and 5) can be fit using MPlus. More information on the models can be found in the vignette.

Here is an example of specifying a model with varying variances and covariances (Model 6; not run here):

pisaUSA15[1:100, ] %>%
    select(broad_interest, enjoyment, self_efficacy) %>%
    single_imputation() %>%
    estimate_profiles(3, 
                      variances = "varying",
                      covariances = "varying")

Comparing a wide range of solutions

The function compare_solutions() compares the fit of several estimated models, with varying numbers of profiles and model specifications:

pisaUSA15[1:100, ] %>%
    select(broad_interest, enjoyment, self_efficacy) %>%
    single_imputation() %>%
    estimate_profiles(1:3, 
                      variances = c("equal", "varying"),
                      covariances = c("zero", "varying")) %>%
    compare_solutions(statistics = c("AIC", "BIC"))

Accessing values in the output

A few helper functions are available to make it easier to work with the output of an analysis.

get_data() returns the data:

m <- pisaUSA15[1:100, ] %>%
    select(broad_interest, enjoyment, self_efficacy) %>%
    single_imputation() %>%
    estimate_profiles(3:4)

get_data(m)

We note that get_data() returns data in wide format when applied to an object of class tidyProfile (one element of a tidyLPA object), or when applied to a tidyLPA object of length one. get_data() returns long format when applied to a tidyLPA object containing multiple tidyProfile analyses (because then the wide format does not make sense).

To transform data in the wide format into the long format, the gather() function from the tidyr package can be used, e.g.:

get_data(m) %>% 
    tidyr::gather(Class_prob, Probability, contains("CPROB"))

## # A tibble: 700 x 9
##    model_number classes_number broad_interest enjoyment self_efficacy Class
##           <dbl>          <int>          <dbl>     <dbl>         <dbl> <dbl>
##  1            1              3            3.8       4            1        1
##  2            1              3            3         3            2.75     3
##  3            1              3            1.8       2.8          3.38     3
##  4            1              3            1.4       1            2.75     2
##  5            1              3            1.8       2.2          2        3
##  6            1              3            1.6       1.6          1.88     3
##  7            1              3            3         3.8          2.25     1
##  8            1              3            2.6       2.2          2        3
##  9            1              3            1         2.8          2.62     3
## 10            1              3            2.2       2            1.75     3
## # … with 690 more rows, and 3 more variables: Class_prob <int>,
## #   Probability <dbl>, id <int>

Recommendations Regarding Analytic Choices

There are a number of analytic choices that need to be made when carrying out person-oriented analyses. Because such person-oriented approaches are often more subjective (in practice) than other approaches (Linnenbrink-Garcia and Wormington, 2017), there is no one rule for determining the solution obtained. This solution is obtained on the basis of multiple decisions, such as the number of profiles selected or the modeling decisions such as what specific options are used for the cluster analysis (i.e., the distance metric used to calculate the similarity of the observations as part of the Ward’s hierarchical clustering) or what parameters are estimated and how as part of LPA.

Given the subjectivity involved, it is important that researchers be transparent and work as part of a team to obtain clustering solutions. Transparency about the design and analytic choices is important so that readers can appropriately interpret the report. Researchers can enhance transparency and reproducibility by sharing detailed descriptions of methodology and document it through the use of syntax (and, if possible, data) that we share with others. Working as part of a team can help to serve as a check on several of the choices researchers make, such as over-fitting or under-fitting the model to the data. Each decision depends on multiple factors and balancing tensions. We discuss each of the key decisions listed in an analysis.

How the Data are Centering or Scaled

The data can be transformed by centering or scaling. Typically, these are done after the profiles are created, so that differences between profiles can be more easily explored. They can also be done prior to the analysis, which can be helpful for obtaining solutions when the variables are on very different scales..

How to Choose the Number of Profiles/Classes

In the case of choosing the number of profiles (and the specification of the model / profile solution), multiple criteria, such as the BIC or the proportion of variance explained are recommended for decision-making, but also interpretability in light of theory, parsimony, and evidence from cross-validation should be considered.

More General Recommendations

This section highlights three general recommendations. First, scholars taking a person-oriented approach should emphasize reproducibility in carrying out analyses. This is in part due to the exploratory and interpretative nature of person-oriented approaches. To this end, presenting multiple models, as in Pastor et al. (2007), should be encouraged, rather than presenting one solution. In addition, having multiple analysts review the solutions found is encouraged. As part of this recommendation, we suggest that researchers consider how their analyses can be reproduced by other analysts outside of the research team: Sharing code and data is an important part of this work. Also related, researchers should consider how profiles are replicated across samples.

A second general recommendation considers more flexibly incorporating time, when data are collected across multiple time points, into analyses. ISOA groups all time points and doesn’t make distinctions. Other approaches perform analysis separately at different timepoints (Corpus & Wormington, 2014). Some integrate time as a part of the profiles, i.e. growth mixture modeling (groups of patterns), i.e. within-person growth modeling, where there are individual growth patterns. Research to date has yet to consider additional challenges in applying person centered approaches to longitudinal data. For instance, Schmidt et al (2018) use an Experience Sampling Method (ESM) approach to collecting data and used a person-centered approach to generate profiles of students in science class. This work does not account for student-level effects. In other words, it did not model the shared variance of multiple observations of the same student.

Third, best practices in within-person or longitudinal research call for modeling the nesting structure. However, researchers have yet to successfully incorporate this practice into the person centered approach. One way to approach this is to use cross-classified mixed effects models, as in Strati, Schmidt, and Maier (2017) and in Rosenberg (2018). In such approaches, dependencies in terms of, for example, individuals responding to (ESM) surveys at the same time and repeated responses being associated with the same individuals can both be modeled, although the effects of these two sources of dependencies are not nested as in very common uses of multi-level models, but rather are cross-classified. West, Welch, & Galecki (2014) have a description of the use of multi-level models with cross-classified data and tools (including those freely available through R) that can be used to estimate them.

Tools for Psychological Science

Importance of creating tools like tidyLPA
Translating hard-to-understand terminology (i.e., in mclust)
Interfacing commercial and open-source options
Being easy to use
Being aligned with a wider ecosystem, the tidyverse

Conclusion

Person-oriented analysis ia way to consider how psychological constructs are experienced (and can be analyzed) together and at once. Though described in contrast to a variable-centered approach, scholars have pointed out how person-oriented approaches are complementary to variable-centered analyses (Marsh, Ludtke, Trautwein, & Morin, 2009). A person-oriented approach can help us to consider multiple variables together and at once and in a more dynamic way, reflected in the methodological approaches for cluster analysis and LPA that identify profiles of individuals responses.

This manuscript provided an outline of how to get started with person-oriented analyses in an informed way. We provided a general overview of the methodology and described tools to carry out such an analysis. We also described specific tools, emphasizing freely-available open-source options that we have developed. Because of the inherently exploratory nature of person-oriented analysis, carrying out the analysis in a trustworthy and open way is particularly important. In this way, the interpretative aspect of settling on a solution shares some features of quantitative and qualitative research: The systematic nature of quantitative research methods (focused upon agreed-upon criteria such as likelihood-ratio tests) and qualitative research methods (focused upon the trustworthiness of both the analysis and the analyst) are important to consider when carrying out person-oriented analysis. Lastly, we made some general recommendations for future directions–and also highlighted some situations for which person-oriented approaches may not be the best and some cautions raised in past research regarding how such approaches are used.

In conclusion, as use of person-oriented approaches expand, new questions and opportunities for carrying out research in a more holistic, dynamic way will be presented. Analyzing constructs together and at once is appealing to researchers, particularly those carrying out research in fields such as education for which communicating findings to stakeholders in a way that has the chance to impact practice is important. Our aim was not to suggest that such an approach is always the goal or should always be carried out, but rather to describe how researchers may get started in an informed way as researchers seek to understand how individuals interact, behave, and learn in ways that embraces the complexity of these experiences.

References

Bergman, L. R., & Magnusson, D. (1997). A person-oriented approach in research on developmental psychopathology. Development and psychopathology, 9(2), 291-319.
Bergman, L. R., & Trost, K. (2006). The person-oriented versus the variable-oriented approach: Are they complementary, opposites, or exploring different worlds?. Merrill-Palmer Quarterly, 52(3), 601-632. Collins, L. M., & Lanza, S. T. (2010). Latent class and latent transition analysis: With applications in the social, behavioral, and health sciences (Vol. 718). John Wiley & Sons. Corpus, J. H., & Wormington, S. V. (2014). Profiles of intrinsic and extrinsic motivations in elementary school: A longitudinal analysis. The Journal of Experimental Education, 82(4), 480-501. Corpus, J. H., & Wormington, S. V. (2014). Profiles of intrinsic and extrinsic motivations in elementary school: A longitudinal analysis. The Journal of Experimental Education, 82(4), 480-501. Harring, J. R., & Hodis, F. A. (2016). Mixture modeling: Applications in educational psychology. Educational Psychologist, 51(3-4), 354-367. Hayenga, A. O., & Corpus, J. H. (2010). Profiles of intrinsic and extrinsic motivations: A person-centered approach to motivation and achievement in middle school. Motivation and Emotion, 34(4), 371-383. Linnenbrink-Garcia, L., & Wormington, S. V. (2017). Key challenges and potential solutions for studying the complexity of motivation in schooling: An integrative, dynamic person-oriented perspective. British Journal of Educational Psychology monograph series II. Psychological aspects of education: Current trends—the role of competence beliefs in teaching and learning. Chichester, UK: Wiley. Magnusson, D., & Cairns, R. B. (1996). Developmental science: Toward a unified framework. Cambridge, England: Cambridge University Press.
Marsh, H. W., Lüdtke, O., Trautwein, U., & Morin, A. J. (2009). Classical latent profile analysis of academic self-concept dimensions: Synergy of person-and variable-centered approaches to theoretical models of self-concept. Structural Equation Modeling, 16(2), 191-225. McLachlan, G. J. (2011). Commentary on Steinley and Brusco (2011): Recommendations and cautions. Muthen, L. K., & Muthen, B. O. (1997-2017). Mplus User’s Guide. Los Angeles, CA: Muthén & Muthén. Rosenberg, J. M., Schmidt, J. A., Beymer, P. N., & Steingut, R. R. (2018). Interface to mclust to easily carry out Latent Profile Analysis [Statistical software for R]. https://github.com/jrosen48/tidyLPA Rosenberg, J. M., Schmidt, J. A., Beymer, P. N., & Steingut, R. R. (2017). prcr: Person-Centered Analysis. R package version 0.1.5. https://CRAN.R-project.org/package=prcr Scrucca L., Fop M., Murphy T. B. and Raftery A. E. (2017) mclust 5: clustering, classification and density estimation using Gaussian finite mixture models The R Journal 8/1, pp. 205-233 Hennig, C., Meila, M., Murtagh, F., & Rocci, R. (Eds.). (2015). Handbook of cluster analysis. CRC Press. Steinley, D., & Brusco, M. J. (2011). Evaluating mixture modeling for clustering: Recommendations and cautions. Psychological Methods, 16(1), 63. Steinley, D., & Brusco, M. J. (2011). K-means clustering and mixture model clustering: Reply to McLachlan (2011) and Vermunt (2011). Trevors, G. J., Kendeou, P., Bråten, I., & Braasch, J. L. (2017). Adolescents’ epistemic profiles in the service of knowledge revision. Contemporary Educational Psychology, 49, 107-120. Vermunt, J. K. (2011). K-means may perform as well as mixture model clustering but may also be much worse: Comment on Steinley and Brusco (2011). West, B. T., Welch, K. B., & Galecki, A. T. (2014). Linear mixed models: a practical guide using statistical software. Chapman and Hall/CRC. Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10), 1-23.

Carrying out Latent Profile Analysis in a more rigorous and reproducible way: A tutorial using the tidyLPA R package