Initial Interpretations of a Predictive Model
Badge
As a reminder, to earn a badge for each lab, you are required to respond to a set of prompts for two parts:
In Part I, you will reflect on your understanding of key concepts and begin to think about potential next steps for your own study.
In Part II, you will take very precursory steps toward interpreting a supervised machine learning model.
Part I: Reflect and Plan
Part A:
- What were some key differences between the regression (inferential) and supervised machine learning (predictive) models?
Regression is a way to make inferences about a population from a sample; whereas, supervised machine learning is a way to make precise/accurate predictions from prior data about unknown data.
The interpretability of the outputs seems more straightforward with the well-established regression techniques, but maybe after this training I will say something different! :)
- Describe how a supervised machine learning approach could be useful given your research interests/a select research question.
- Josh already knows that I want train the machine to identify certain key words and phrases that coaches give their students during practices (in robotics competitions) as evidence of “need-supportive coaching” (based on examples from self-determination theory research); then, I want to test how well the machine codes future data from coaches since I will be collecting video/audio data to capture these words and phrases over three years.
- Describe how a regression modeling approach (or an extension of a regression approach, such as an SEM or multi-level model) could be useful given your research interests/a select research question.
- Since I will be examining the differences between an intervention group and a comparison group, I am looking at how kids are nested in teams nested in intervention groups. So, this is MLM for the comparative analysis (although not sure I have enough units of measurement at Level 2). I am trying to tease out how the impact of professional development on coaches will improve the feedback they give students during practice. So, if I teach coaches the kind of language they can give kids that is most need-supportive and then I compare the feedback that coaches in the intervention group are giving versus those in the comparison group. For example, if I train a machine to recognize “autonomy supportive” feedback (e.g., “feel free to work with a friend”) or “competence supportive” (e.g., “I see you have a really efficient code for your robot’s movements on the Lego task.”), I would hypothesize that need-supportive feedback derived from coaches in the intervention group will be more prevalent than that from the comparison group (assuming the trained model is accurate). Eventually, we will use prevalence of need-supportive feedback as a predictor of students’ perceptions of need support (for competence, autonomy, and relatedness/belongingness) which will be modeled as a mediator/moderator for youth creativity in a team context (creative robots).
Part B: Use institutional library access (e.g. NCSU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies supervised machine learning to an educational context aligned with your research interests.
- Provide an APA citation for your selected study.
- King, R.B., Wang, Y., Fu, L. et al. Identifying the top predictors of student well-being across cultures using machine learning and conventional statistics. Sci Rep 14, 8376 (2024). https://doi.org/10.1038/s41598-024-55461-3
I was going to use this but it was not an application: Farina, M., Lavazza, A., Sartori, G. et al. Machine learning in human creativity: status and perspectives. AI & Soc (2024). https://doi.org/10.1007/s00146-023-01836-5
- What research questions were the authors of this study trying to address using Latent Profile Analysis or a similar method?
Authors examined key factors predicting adolescent students’ subjective well-being, indexed by life satisfaction, positive affect, and negative affect. They used data from PISA admininstration of open-ended questions to gather text for analysis. They compared the traditional v. ML approaches to this analyses. King et al. (2024) wrote this as their aims: “(1) identify the most important predictors of students’ subjective well-being using machine learning approaches (prediction) and (2) explore how these predictors contributed to explaining variance in students’ subjective well-being using conventional statistics (explanation).”
- What were the results of these analyses?
Just from cutting and pasted from their abstract: “Among the multiple predictors examined, school belonging and sense of meaning emerged as the common predictors of the various well-being dimensions. Different well-being dimensions also had distinct predictors. Life satisfaction was best predicted by a sense of meaning, school belonging, parental support, fear of failure, and GDP per capita. Positive affect was most strongly predicted by resilience, sense of meaning, school belonging, parental support, and GDP per capita. Negative affect was most strongly predicted by fear of failure, gender, being bullied, school belonging, and sense of meaning. There was a remarkable level of cross-cultural similarity in terms of the top predictors of well-being across the globe.” -
- Lastly, what value might a Computational Grounded Theory analysis have in the context of their analysis?
In the context of the study I just examined, it appears that the authors did not use CGT because they did not collect open-ended responses to items that would have allowed for the inductive development of a list of indicators of subjective well-being. Instead, they collected likert-type scale data from theoretically-based scale items. They could have restructred the items to allow written responses from test takers, used machine learning to detect patterns in the qualtitative text, compared those results to the traditional set of factors and their item-specific indicators for subjective well-being, used humans to conduct an interpretative analysis that could help refine the model. Then, confirm the new model’s validity on new data sets.
To be honest, I’m a bit fuzzy on how ML was used in their study because I don’t understand how they could use it with the type of data they reported was in PISA. Here is what they wrote: “Step 1: machine learning To address the first research objective of identifying the most important predictors of students’ subjective well-being, we used a machine learning algorithm to model the three elements of subjective well-being. The scikit-learn package was used to perform five tree-based ensemble machine learning algorithms to identify the top predictors of well-being. We used different algorithms including gradient boosted decision tree (GBDT), adaptive boosting (AdaBoost), ExtraTrees (ET), RandomForest (RF), and light gradient-boost machine (LightGBM). We compared the predictive accuracy of these five algorithms and selected the best among them. Mean Square Error (MSE) was used to determine the prediction accuracy of the model. Mean Absolute Error (MAE) was used to evaluate the differences between the prediction and true value. Lower MSE and MAE values indicate a higher rate of model accuracy. The coefficient of determination (R2) explains the amount of variance in well-being accounted for by the predictors.”
Part II: Interpret our Predictive Model
Here, we are going to interpret our predictive model in a very precursory way. Later, you will have the opportunity to dive deep into metrics for interpreting supervised machine learning models. Now, just interpret the output of the supervised machine learning model from the case study on its face. What does this model seem to tell us? How useful is this predictive model? You may find the readings helpful for this task - particularly, the Baker et al. (2023) paper.
What I am taking away from the case study is that the supervised machine learning model is useful for identifying key/top predictors and optimizing prediction accuracy. This is essentially different in purpose than explaining the role of particular predictors/indicators. The utility for the machine learning is that it does a great job of accounting for the influences of messy or complex variables that may not always be opertionally defined clearly. It allows for the researcher to recognize their importance while still recognizing that they also be measured with similar, and thus collinear variables.
Knit and Publish
Complete the following steps to knit and publish your work:
First, change the name of the
author:in the YAML header at the very top of this document to your name. The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let’s us know if you run into any issues with knitting.
Finally, publish your webpage on Posit Cloud by clicking the “Publish” button located in the Viewer Pane after you knit your document. See screenshot below.
Receive Your Badge
To receive credit for this assignment and earn your first ML badge, share the link to published webpage under the next incomplete badge artifact column on the LASER Scholar Information and Documents spreadsheet: https://go.ncsu.edu/laser-sheet.
Once your instructor has checked your link, you will be provided a physical version of the badge below!