Overview of the Study
Getting started: Get the data in and clean it up
- Set up R environment
- Fix up the formats
- Shorten actual values
- Data Clean-up
- Number of valid & complete observations (“cleandata” dataset)
Descriptive Statistics
Calculate scores from instruments
- Quick: Run an ANOVA against the profID field
- Report participants’ details
  - Experimenting with apaTables
  - PRACTICING WITH NICE_TABLE
Limit Dataset to Desired Participant Attributes
- Number of observations in subset to be analyzed
K-means clustering per Spanierman
- Consistency checks - Cronbach’s alpha
- Look for outliers
Examine sample for model assumptions
- Histograms on the sample variables
Analysis -
- Descriptive Statistics NEW
  - Messiness below
- Correlation matrix with values rounded to 3 decimals
- Basic scatter plots
  - RANDOM
- Basic linear regression on ACCOUNT
  - Basic linear regression without WTGUILT
  - Examining the residuals plots for original attempted model
- Identify Outliers
- Correlations using Kendall’s tau
Goodness of Fit: Examine R-squared
Non-parametric tests considered
- One sample t-test
  - Spearman’s rho
Throwing shit at the wall
- Quantile Regression
- Nonparametric regression (fancy stuff)
- GLMULTI package
- Nadaray Watson kernel function
My understanding so far
Linear regression of the white racial affects on SELFEFF
Linear regression of the white racial affects on ACCOUNT
Linear regression of ACCOUNT with just SELFEFF and WTFEAR
Variable Selection
- Sequential Estimation
Stepwise Regression
Partial Mediation Model
Robust Regression
ANOVA
- ACCOUNT and sexuality
- ACCOUNT and age group
- ACCOUNT and region
- ACCOUNT and SES (class)
- ACCOUNT and counseling experience (years practicing)
Bootstrap
- Run the LM against the boostrapped data
- Robust ??
Other types of regressions I’ve tried to run
- Generalized Linear Model
- Two-Stage Least-Squares Regression using ivreg
Statistical Mediation Analysis (invalid? apparently underpowered)
- Rough conclusions based on this output
Path Analysis (invalid for this study?? underpowered?)
- New using SEMinR
- This is the confirmed model??
  - Maybe this is the actual relationship?

Last knit on March 17, 2024 at 10:00

This is an R Markdown Notebook. It has captured my learning process on understanding data analysis and so is very rough. It contains my code, with results of executing the code beneath each chunk. Pieces of this will be converted into my Results chapter.

For my Dissertation Committee: Use the TOC links above and skip ahead to the Descriptive Statistics section, there is a lot of housekeeping code before then.

Some questions are highlighted in yellow…

3/17/24 Arcenis consult - GOALS

Discuss his recommendations for method using lavaan
My goal for today: Can I get a version of the actual diagram (scroll down a tad) but with coefficients? (Not using the 4-variable regression because that’s not capturing the mediation, it’s got all 4 variables as directly effecting the DV)
Discuss my Kendall’s tau calculation
Confirm that the “variable selection” idea from Dr. Patrick does not allow me to test the mediation effect
Look at my existing lavaan code - perhaps start here on what I believe may be a valid model, but cannot use because of underpowered?
Look at my attempts via mediation package? Which I also discovered is underpowered
If needed: Go back to my attempts at “manual” mediation analysis (“Partial Mediation Model”) per original document shared - this may need to be the way that I need to go in my dissertation since others in my field and probably my committee will object to using path analysis given underpowered study, unless I can cite the heck out of it to show that it’s legit even at my 62 observations.

Overview of the Study

This study has proposed a model that captures the relationships between three white racial affects (emotions) and antiracist self-efficacy, on the dependent variable antiracist accountability. Each of these variables has an instrument associated with it with Likert-style items. The mean of each scale is calculated in the script below, which are the IVs and DV (WTGUILT, WTFEAR, WTANGER, SELFEFF, ACCOUNT).

My full dataset is about 140 complete/valid observations, however the population of interest is counselors and that subset is 62 observations.

library(DiagrammeR)

#| label: apafg-conceptualmodel
#| apa-cap: Conceptual Model (DiagrammeR package)
#| apa-note: Conceptual model of hypothesis exploring the effect of white racial affects (white guilt, white fear, white anger) and mediator variable antiracist self-efficacy on antiracist accountability.

proposedpath <- create_graph() %>%
  add_node(label = "White Anger") %>%
  add_node(label = "White Guilt") %>%
  add_node(label = "White Fear") %>%
  add_node(label = "Antiracist Self-Efficacy") %>%
  add_node(label = "Antiracist Accountability") %>%  
  add_edge(from = 1, to = 4) %>%
  add_edge(from = 2, to = 4) %>%  
  add_edge(from = 3, to = 4) %>%  
  add_edge(from = 4, to = 5) %>%  
  set_node_attrs(
    node_attr = "shape",
    values = "square"
  ) %>%
  render_graph(layout = "tree")

proposedpath

#install.packages("psych")
library(diagram)

## Loading required package: shape

library(psych)

#| label: apafg-conceptualmodel-old
#| apa-cap: Conceptual Model
#| apa-note: Conceptual model of hypothesis exploring the effect of white racial affects (white guilt, white fear, white anger) and mediator variable antiracist self-efficacy on antiracist accountability. 

#first, show the primitives
xlim=c(-2,10)
ylim=c(0,10)
plot(NA,xlim=xlim,ylim=ylim,main="",axes=FALSE,xlab="",ylab="")
wg <- dia.rect(1,9,labels="white guilt",xlim=xlim,ylim=ylim)
wf <- dia.rect(1,5,"white fear",xlim=xlim,ylim=ylim)
wa <- dia.rect(1,1,labels="white anger",xlim=xlim,ylim=ylim)
se <- dia.rect(4,5,"antiracist\nself-efficacy",xlim=xlim,ylim=ylim)

aa <- dia.rect(9,5,"antiracist\naccountability",xlim=xlim,ylim=ylim) 

dia.arrow(from=wg$right,to=se$top,labels="")
dia.arrow(from=wf$right,to=se$left,labels="")
dia.arrow(wa$right,to=se$bottom,labels="")
#dia.arrow(from=se,to=aa,labels="")
dia.arrow(se,aa)

Getting started: Get the data in and clean it up

A variety of clean-up tasks need to be done on the dataset before analysis:

Set up the R environment with packages.
Read in the CSV file from SurveyMonkey.
The SurveyMonkey CSV was created through: Save As -> Export File -> All responses data, with options of CSV, Original View (no rules applied), Condensed, Actual Answer Text. This creates a zip file and the dataset of “Lisa Wenninger - Dissertation Research.csv” was extracted and placed in a subdirectory called “data” off the directory that this Rmd script is saved in.

Set up R environment

Set variables to be used to filter the dataset

In the R script, these options are configured to allow for different outputs of the analysis. Warning: This is brittle, all options not fully built out in code below. Can currently do all responses, counseling professionals only, or therapists only.

# Set all the below options to TRUE if limiting the analysis only to the study criteria
# If analyzing other professions, set counselingprofessionalsonly to FALSE and then configure them individually below
# counselingprofessionalsonly is counselors and counselor educators
counselingprofessionalsonly = TRUE
include.woman = TRUE
include.white = TRUE
include.alternatives = FALSE
# Note that some respondents used the open-text fields to indicate alternative answers for race and gender identity. These can either be excluded as a whole (meaning, only those who answered definitively as "white" and as "woman" in the radio button question on race and gender), or they can be conditionally included, based on code in the Data Cleanup section below. 

# reset all variables - default is to include all professions

# IMPORTANT: The counselingprofessionalsonly ABOVE flag overrides the individual settings (probably bad design)
include.counselors = TRUE
include.counseloreducators = TRUE
include.therapists = FALSE
include.socialworkers = FALSE
include.psychologists = FALSE 
# The 3 obs from psychologists are explicitly excluded below regardless of what is answered here

Load libraries

#if get kableExtra error when knitting pdf, install via devtools
library(kableExtra)
library(vtable)
library(tibble)
library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:kableExtra':
## 
##     group_rows

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ purrr     1.0.2
## ✔ ggplot2   3.5.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::%+%()      masks psych::%+%()
## ✖ ggplot2::alpha()    masks psych::alpha()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::group_rows() masks kableExtra::group_rows()
## ✖ dplyr::lag()        masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(ggstats)
library(psych)
library(labelled)
library(rstatix)

## 
## Attaching package: 'rstatix'
## 
## The following object is masked from 'package:stats':
## 
##     filter

library(performance)
library(report)
library(rempsyc) # this can output "report" objects (from report package) and it gives me the apa tables I need

## Suggested APA citation: Thériault, R. (2023). rempsyc: Convenience functions for psychology. 
## Journal of Open Source Software, 8(87), 5466. https://doi.org/10.21105/joss.05466

# https://cran.r-project.org/web/packages/rempsyc/vignettes/table.html
# https://remi-theriault.com/tutorials/


library(conflicted)
library(flextable)

## Warning: package 'flextable' was built under R version 4.3.3

library(ftExtra)

## Warning: package 'ftExtra' was built under R version 4.3.3

## Registered S3 method overwritten by 'ftExtra':
##   method                  from     
##   as_flextable.data.frame flextable

library(officer)

## Warning: package 'officer' was built under R version 4.3.3

library(knitr)
conflicts_prefer(dplyr::filter, .quiet = TRUE)
conflicts_prefer(flextable::separate_header, .quiet = TRUE)


# Note: sjPlot may be a package to look into - data vis for stats in social sciences
# http://www.strengejacke.de/sjPlot/


# for power analysis - monte carlo simulations for lavaan
# abandoned, could not figure it out
#library(simsem) 

library(seminr) #also abandoned but it has potential? sits on top of lavaan

## Warning: package 'seminr' was built under R version 4.3.3

Read in CSV file

The read_csv command is also set to identify default datatypes per column.

# reset memory
cleandata <- NULL
surveymonkeyraw <- NULL

# The export from SM should be: All Responses Data, CSV, Condensed, Original View (no rules applied), Actual Answer Text. There is no need for the PageOrder.csv file

# Below command throws a parsing issue warning but it's about row 2, a second header row output by SurveyMonkey, which gets discarded

# read the SurveyMonkey export and set datatypes; read_csv creates a tibble (not a dataframe)
surveymonkeyraw <- read_csv("data\\Lisa Wenninger - Dissertation Research.csv", col_types = cols(.default = "c", 'Start Date' = "D", 'End Date' = "D", '...11' = "c", '...14' = "c", '...17' = "c"))

## New names:
## • `` -> `...11`
## • `` -> `...14`
## • `` -> `...17`
## • `` -> `...23`
## • `` -> `...24`
## • `` -> `...25`
## • `` -> `...26`
## • `` -> `...27`
## • `` -> `...28`
## • `` -> `...29`
## • `` -> `...30`
## • `` -> `...31`
## • `` -> `...32`
## • `` -> `...33`
## • `` -> `...34`
## • `` -> `...35`

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

Fix up the formats

Remove the unnecessary columns added by SurveyMonkey, and rename columns to usable shortnames for purpose of these scripts.

The raw CSV file output from SurveyMonkey used the prompts from each survey item as the column header. These prompts are full sentence questions, and are quite unwieldy to use in programmatic analysis. Therefore, the columns will be renamed.

# First remove extraneous columns - these were all blank
cleandata <- select(surveymonkeyraw, -c('IP Address', 'Email Address', 'Collector ID', 'First Name', 'Last Name', 'Custom Data 1'))

#remove row 2 which is SM descriptor, not data
# This puts NA in the row
cleandata <- cleandata[-c(1), ]

# grab original SurveyMonkey column names as labels
originalsm.labels <- colnames(cleandata)

# renaming cols from long descriptions to simple var names by column index
cleandata <- cleandata %>% 
        rename(gender = 4, genderdescrip = 5, trans = 6, race = 7, raceother = 8, agecat = 9, sexuality = 10, sexualityother = 11, relationstatus = 12, region = 13, education = 14, class = 15, profID = 30, yrsprac = 31)

# rename the multiple-choice religion columns
# not currently doing any analysis on this data, maybe circle back later to output
# summary statistics on it? very low priority
cleandata <- cleandata %>% 
        rename(religprotestant = 16, religcatholic= 17, religchristian = 18, religjudaism = 19, religislam = 20, religbuddhism = 21, relighinduism = 22, relignativeamerican = 23, religpagan = 24, religwiccan = 25, religinterdenom = 26, relignotreligious = 27, religdidnotdisclose = 28, religotherspecified = 29 )


# Rename cols for each instrument
# Cols 32-47 are the 16 PCRW questions
cleandata <- cleandata %>% 
        rename(PCRWQ01 = 32, PCRWQ02 = 33, PCRWQ03 = 34, PCRWQ04 = 35, PCRWQ05 = 36, PCRWQ06 = 37, PCRWQ07 = 38, PCRWQ08 = 39, PCRWQ09 = 40, PCRWQ10 = 41, PCRWQ11 = 42, PCRWQ12 = 43, PCRWQ13 = 44, PCRWQ14 = 45, PCRWQ15 = 46, PCRWQ16 = 47)

# Cols 48-59 are the 12 WPAS questions
cleandata <- cleandata %>% 
        rename(WPASQ01 = 48, WPASQ02 = 49, WPASQ03 = 50, WPASQ04 = 51, WPASQ05 = 52, WPASQ06 = 53, WPASQ07 = 54, WPASQ08 = 55, WPASQ09 = 56, WPASQ10 = 57, WPASQ11 = 58, WPASQ12 = 59)

# Col 60 is attention check for switch of axes on A-RES after the 12 WPAS questions
cleandata <- cleandata %>% 
        rename(attentioncheck = 60)

# Cols 61-64 are the A-RES questions
cleandata <- cleandata %>% 
        rename(ARESQ01 = 61, ARESQ02 = 62, ARESQ03 = 63, ARESQ04 = 64)

# convert datatypes for instrument items from character to factor

# 6-level Likert-type values in PCRW and WPAS scales
likert_levels_6 <- c(
  "Strongly Agree",
  "Moderately Agree",
  "Slightly Agree",
  "Slightly Disagree",
  "Moderately Disagree",
  "Strongly Disagree"
)

# 4-level Likert-type values in A-RES scale
likert_levels_4 <- c(
  "Agree strongly",
  "Agree",
  "Disagree",
  "Disagree strongly"
)

# Apply 6 level factor to all items in both PCRW and WPAS
cleandata <- cleandata %>%
    mutate(across(PCRWQ01:WPASQ12, ~ factor(.x, levels = likert_levels_6)))

# Apply 4 level factor to items in ARES
cleandata <- cleandata %>%
    mutate(across(ARESQ01:ARESQ04, ~ factor(.x, levels = likert_levels_4)))

# assign the SM labels
# This has to happen after the mutate factor step which would otherwise
# wipe out the labels
cleandata <- cleandata %>%
  set_variable_labels(.labels = originalsm.labels )


vt(cleandata)

cleandata
Name	Class	Label	Values
Respondent ID	character	Respondent ID
Start Date	Date	Start Date
End Date	Date	End Date
gender	character	Gender: How do you primarily identify?
genderdescrip	character	…11
trans	character	Do you identify as trans? (optional)
race	character	Which race/ethnicity best describes you? (Please choose only one.)
raceother	character	…14
agecat	character	Which category below includes your age?
sexuality	character	What best describes how you identify your sexual or romantic interests or affiliation/orientation?
sexualityother	character	…17
relationstatus	character	Which of the following best describes your current relationship status?
region	character	In which region of the United States do you primarily live? (Or, where did you live for the longest time period, if you are not currently a U.S. resident?)
education	character	What is the highest level of school you have completed or highest degree you have received?
class	character	How would you identify your current social class?
religprotestant	character	Do you identify with any of the following religions? (Please select all that apply.)
religcatholic	character	…23
religchristian	character	…24
religjudaism	character	…25
religislam	character	…26
religbuddhism	character	…27
relighinduism	character	…28
relignativeamerican	character	…29
religpagan	character	…30
religwiccan	character	…31
religinterdenom	character	…32
relignotreligious	character	…33
religdidnotdisclose	character	…34
religotherspecified	character	…35
profID	character	What is your primary professional identity in the mental health field? If you are pre-licensed, please answer according to the field you are studying in or what license you will qualify for when training is done.
yrsprac	character	How many years have you been practicing mental health counseling? Include your pre-licensed period of training in your total.
PCRWQ01	factor	When I hear about acts of racial violence, I become angry or depressed.	‘Strongly Agree’ ‘Moderately Agree’ ‘Slightly Agree’ ‘Slightly Disagree’ ‘Moderately Disagree’ and 1 more
PCRWQ02	factor	I feel safe in most neighborhoods, regardless of the racial composition.	‘Strongly Agree’ ‘Moderately Agree’ ‘Slightly Agree’ ‘Slightly Disagree’ ‘Moderately Disagree’ and 1 more
PCRWQ03	factor	I feel helpless about not being able to eliminate racism.	‘Strongly Agree’ ‘Moderately Agree’ ‘Slightly Agree’ ‘Slightly Disagree’ ‘Moderately Disagree’ and 1 more
PCRWQ04	factor	Sometimes I feel guilty about being White.	‘Strongly Agree’ ‘Moderately Agree’ ‘Slightly Agree’ ‘Slightly Disagree’ ‘Moderately Disagree’ and 1 more
PCRWQ05	factor	I have very few friends of other races.	‘Strongly Agree’ ‘Moderately Agree’ ‘Slightly Agree’ ‘Slightly Disagree’ ‘Moderately Disagree’ and 1 more
PCRWQ06	factor	I become sad when I think about racial injustice.	‘Strongly Agree’ ‘Moderately Agree’ ‘Slightly Agree’ ‘Slightly Disagree’ ‘Moderately Disagree’ and 1 more
PCRWQ07	factor	Being White makes me feel personally responsible for racism.	‘Strongly Agree’ ‘Moderately Agree’ ‘Slightly Agree’ ‘Slightly Disagree’ ‘Moderately Disagree’ and 1 more
PCRWQ08	factor	I never feel ashamed about being White.	‘Strongly Agree’ ‘Moderately Agree’ ‘Slightly Agree’ ‘Slightly Disagree’ ‘Moderately Disagree’ and 1 more
PCRWQ09	factor	I am fearful that racial minority populations are rapidly increasing in the U.S., and my group will no longer be the numerical majority.	‘Strongly Agree’ ‘Moderately Agree’ ‘Slightly Agree’ ‘Slightly Disagree’ ‘Moderately Disagree’ and 1 more
PCRWQ10	factor	I am angry that racism exists.	‘Strongly Agree’ ‘Moderately Agree’ ‘Slightly Agree’ ‘Slightly Disagree’ ‘Moderately Disagree’ and 1 more
PCRWQ11	factor	I am distrustful of people of other races.	‘Strongly Agree’ ‘Moderately Agree’ ‘Slightly Agree’ ‘Slightly Disagree’ ‘Moderately Disagree’ and 1 more
PCRWQ12	factor	I feel good about being White.	‘Strongly Agree’ ‘Moderately Agree’ ‘Slightly Agree’ ‘Slightly Disagree’ ‘Moderately Disagree’ and 1 more
PCRWQ13	factor	I often find myself fearful of people of other races.	‘Strongly Agree’ ‘Moderately Agree’ ‘Slightly Agree’ ‘Slightly Disagree’ ‘Moderately Disagree’ and 1 more
PCRWQ14	factor	Racism is dehumanizing to people of all races, including Whites.	‘Strongly Agree’ ‘Moderately Agree’ ‘Slightly Agree’ ‘Slightly Disagree’ ‘Moderately Disagree’ and 1 more
PCRWQ15	factor	I am afraid that I abuse my power and privilege as a White person.	‘Strongly Agree’ ‘Moderately Agree’ ‘Slightly Agree’ ‘Slightly Disagree’ ‘Moderately Disagree’ and 1 more
PCRWQ16	factor	It disturbs me when people express racist views.	‘Strongly Agree’ ‘Moderately Agree’ ‘Slightly Agree’ ‘Slightly Disagree’ ‘Moderately Disagree’ and 1 more
WPASQ01	factor	I plan to work to change our unfair social structure that promotes racism.	‘Strongly Agree’ ‘Moderately Agree’ ‘Slightly Agree’ ‘Slightly Disagree’ ‘Moderately Disagree’ and 1 more
WPASQ02	factor	I take action against racism with people I know.	‘Strongly Agree’ ‘Moderately Agree’ ‘Slightly Agree’ ‘Slightly Disagree’ ‘Moderately Disagree’ and 1 more
WPASQ03	factor	I accept responsibility to change racism.	‘Strongly Agree’ ‘Moderately Agree’ ‘Slightly Agree’ ‘Slightly Disagree’ ‘Moderately Disagree’ and 1 more
WPASQ04	factor	I have not done anything about racism.	‘Strongly Agree’ ‘Moderately Agree’ ‘Slightly Agree’ ‘Slightly Disagree’ ‘Moderately Disagree’ and 1 more
WPASQ05	factor	I look forward to creating a more racially equitable society.	‘Strongly Agree’ ‘Moderately Agree’ ‘Slightly Agree’ ‘Slightly Disagree’ ‘Moderately Disagree’ and 1 more
WPASQ06	factor	I intend to work toward dismantling racism.	‘Strongly Agree’ ‘Moderately Agree’ ‘Slightly Agree’ ‘Slightly Disagree’ ‘Moderately Disagree’ and 1 more
WPASQ07	factor	I don’t care to explore how I supposedly have unearned benefits from being White.	‘Strongly Agree’ ‘Moderately Agree’ ‘Slightly Agree’ ‘Slightly Disagree’ ‘Moderately Disagree’ and 1 more
WPASQ08	factor	I am curious about how to communicate effectively to break down racism.	‘Strongly Agree’ ‘Moderately Agree’ ‘Slightly Agree’ ‘Slightly Disagree’ ‘Moderately Disagree’ and 1 more
WPASQ09	factor	I’m glad to explore my racism.	‘Strongly Agree’ ‘Moderately Agree’ ‘Slightly Agree’ ‘Slightly Disagree’ ‘Moderately Disagree’ and 1 more
WPASQ10	factor	I want to begin the process of eliminating racism.	‘Strongly Agree’ ‘Moderately Agree’ ‘Slightly Agree’ ‘Slightly Disagree’ ‘Moderately Disagree’ and 1 more
WPASQ11	factor	I take action to dismantle racism.	‘Strongly Agree’ ‘Moderately Agree’ ‘Slightly Agree’ ‘Slightly Disagree’ ‘Moderately Disagree’ and 1 more
WPASQ12	factor	I am eager to find out more about letting go of racism.	‘Strongly Agree’ ‘Moderately Agree’ ‘Slightly Agree’ ‘Slightly Disagree’ ‘Moderately Disagree’ and 1 more
attentioncheck	character	Please respond to this item by marking the option “Agree strongly.”
ARESQ01	factor	People like you can influence the way racism affects others.	‘Agree strongly’ ‘Agree’ ‘Disagree’ ‘Disagree strongly’
ARESQ02	factor	You consider yourself well qualified to participate in movements related to racism.	‘Agree strongly’ ‘Agree’ ‘Disagree’ ‘Disagree strongly’
ARESQ03	factor	You have a pretty good understanding of the important issues facing our country around racism.	‘Agree strongly’ ‘Agree’ ‘Disagree’ ‘Disagree strongly’
ARESQ04	factor	People like you don’t have any say about racism.	‘Agree strongly’ ‘Agree’ ‘Disagree’ ‘Disagree strongly’

Make the dataset a tad easier to parse - this is intended as temporary code, eval=false later. It also means that index-level column references WILL NOT WORK below this point.

cleandata <- cleandata %>%
  select(-starts_with('relig'))

cleandata <- cleandata %>%
  select(-contains('date'))

Shorten actual values

The options offered to respondents in the SurveyMonkey form were descriptive and often long (such as “White / European American”). This step replaces the long version with shorter versions (such as “white”) for improved readability in charts/graphs output.

library(dplyr)

# recode values in rows

cleandata <- cleandata %>% 
  mutate(gender = recode(gender, 'Woman or female' = 'woman',
        'Non-binary or gender nonconforming' = 'nonbinary',
        'Man or male' = 'man',
        'Prefer to self-describe\\, below' = 'self-describe'))

cleandata <- cleandata %>% 
  mutate(relationstatus = recode(relationstatus, "Cohabitating with a significant other or in a domestic partnership" = "Domestic Partner",
        "Single, never married, currently unpartnered" = "Single",
        "Dating one or more partners (not cohabitating)" = "Dating",
        "Skip/Prefer not to say" = "Did not disclose"))

cleandata <- cleandata %>% 
  mutate(region = recode(region, "South (AL, AK, DC, DE, FL, GA, KY, LA, MD, MS, NC, OK, SC, TN, TX, VA, WV)" = "South",
        "West (AK, AZ, CA, CO, HI, ID, MT, NV, NM, OR, UT, WA, WY)" = "West",
        "Midwest (IL, IN, IA, KS, MI, MN, MO, NE, ND, OH, SD, WI)" = "Midwest",
        "Northeast (CT, ME, MA, NH, NJ, NY, PA, RI, VT)" = "Northeast",
#        "U.S. territory, U.S. military base, etc (American Samoa, Guam, PR, US-VI, etc.)" = "US territory"
        ))

cleandata <- cleandata %>% 
  mutate(profID = recode(profID, "Counselor (including Clinical Counselor, Career Counselor, Rehab Counselor, School Counselor, Mental Health Counselor, etc)" = "counselor", 
        "Counselor Educator, teaching in a counselor education program at either the master's or doctoral level" = "counselor educator", 
        "Therapist (including Marriage and Family Therapist, Mental Health Therapist, etc)" = "therapist",
        "Social Worker (including Clinical Social Worker, Master Social Worker, etc)" = "social worker",
        "Psychologist (including Clinical Psychologist)" = "psychologist",
        "Psychiatrist (including MD, DO, etc)" = "psychiatrist",
        "other allied mental health professional (including RN, PA, etc)" = "other",
        "none of these, I am not a mental health professional" = 'none' ))

Data Clean-up

The raw dataset needs to be examined to be sure that invalid and incomplete responses are excluded from the analysis, and to confirm that the resulting subset of data meets parametric assumptions (Flamez et al., 2017) before analysis is conducted.

Exclude incomplete and invalid responses

Remove any incomplete rows, where the participant exited the study before completing it, or entered a disqualifying response on the gender or race questions.

Remove responses that failed the attention check that preceded the last instrument. The reason for the attention check was that the last instrument’s scale is flipped compared to the previous 2 instruments (“strongly agree” on the left of the screen instead of on the right).

attncheckfails = 0
incompletes = 0
numrawobs = 0
observations = 0

#preserve the full dataset in case it's needed
fulldata <- cleandata

#continue with operations on cleandata as the baseline of actual data to be used for analysis

# 1) Cull incompletes
# Check for final question in the whole survey being blank
# Later analysis: Where did people drop off? 
incompletes <- cleandata %>% 
  filter(
    (is.na(ARESQ04) | ARESQ04 == "")
  )

cleandata <- cleandata %>% 
  filter(
    (!is.na(ARESQ04) | ARESQ04 != "")
  )

# 2) Failed attention check 
attncheckfails <- cleandata %>% 
  filter(
    (attentioncheck != 'Agree strongly')
  )

cleandata <- cleandata %>% 
  filter(
    (attentioncheck == 'Agree strongly')
  )

numrawobs = nrow(fulldata)
numcleanobs = nrow(cleandata)

Examine alternative fill-in answers for inclusion/exclusion

The “gender” and “race” questions allowed participants to self-identify in alternative ways. Some of these answers may indicate a qualifying response, some will indicate disqualifying. These questions were included in order to make the research inclusive and supportive of identities.

The code below was developed after manually inspecting these fields and identifying values that can be included or should be excluded. If more observations are gathered, then those subsequent records would need to also be manually inspected.

For a strict interpretation of respondent answers, use the flag at top of the notebook to turn include.alternatives to FALSE.

# A value was only entered in the raceother column if the respondent chose Altnerative for the 
# Race question
altrace = NULL

#This entire section will be skipped if include.alternatives flag set at top is false
if (include.alternatives == TRUE) {
  
# First display the data in these alternative fields

altrace  <- cleandata %>% 
  select(raceother) %>%
    filter((!is.na(raceother) | raceother != "")
    )

# Based on manual inspection, exclude the "White & Asian" answer, mixed or biracial does not meet study criteria   
cleandata <- cleandata %>% 
   filter((!str_detect(raceother, 'Asian')) | is.na(raceother))

cleandata <- cleandata %>% 
  mutate(race = recode(race, "Alternate or self-identified answer:" = "Other White"))

numcleanobs = nrow(cleandata)
}

### BELOW NOT WORKING
### New to fix the sexuality fields
cleandata <- cleandata %>% 
  mutate(
    sexualty = case_when(
      sexualityother == "Asexual/panromantic" ~ "Asexual",
      sexualityother == "Demi-pansexual" ~ "Pansexual"
    )
  )

Values from the fill-in “other race” field in the currently selected data subset are:

The respondents who indicated “Jewish” as part of their alternative answer are included in the study, as long as that is the sole explanatory answer.

The respondent who indicated “Asian” as part of their alternative answer is excluded.

For gender identity, the SurveyMonkey survey was set up to disqualify all answers except for “woman or female.”

None of the respondents who answered “woman or female” also answered “Yes” to the question asking if they identify as trans. One respondent answered “Non-binary or gender nonconforming” to the gender question, then answered “Yes” to trans, however they did not complete the instruments due to the “non-binary” answer for gender. The survey was set up to allow a respondent to answer “Yes” for the trans question, but they would have had to self-identify as “woman or female” in order to proceed with the survey instruments.

Remove psychologists

# Save the records
psychobs <- cleandata %>% 
  filter(
    (profID == 'psychologist')
  )

cleandata <- cleandata %>% 
  filter(
    (profID != 'psychologist')
  )

numcleanobs = nrow(cleandata)
numpsychobs = nrow(psychobs)

The full dataset included 3 observations from respondents who identify primarily as psychologists.

Psychologists have different training and hold a different professional identity, and were not targeted as part of this study. In addition,3 is a very small number and does not allow for meaningful analysis or comparisons to other groups. Therefore observations from respondents identifying as psychologists as primary professional identity will be removed.

Number of valid & complete observations (“cleandata” dataset)

The original dataset from SurveyMonkey had 149 rows (not including 2 header rows).

14 respondents entered disqualifying identity responses or quit before completing the survey and their partial entries are excluded.

3 observations failed the attention check and are excluded.

3 respondents identified as psychologists and are excluded.

The dataset is now at 129 rows.

Descriptive Statistics

Run some statistics on the data that remains in order to check assumptions in advance of parametric analysis.

df_gender <- cleandata %>%
  group_by(gender) %>%
  summarise(counts = n())
df_gender

# ggplot(df_gender, aes(x=gender, y=counts)) +
#   geom_bar(fill = "#0073C2FF", stat="identity") +
#   theme_minimal()

Note: The raw dataset did include several respondents who self-identified as non-binary, however that was a disqualifying response (along with self-identifying as male). Those partial responses were excluded.

By contrast, one respondent left the yes/no question asking if trans blank, and gave a description of their identity. Since they self-identified as female in the main gender question, their response is included.

If on the gender or race question, they selected “Other (describe)” then their response was not automatically excluded by SurveyMonkey, but instead was manually examined to determine if the self-identification description was sufficient to meet the study criteria or not. This needs to be done outside this script (unless there’s a way to create a prompt with a yes/no answer during the knit process?). For now, such answers are being excluded further down the script; this section is only highlighting that they exist in the pared-down dataset.

df_trans <- cleandata %>%
  group_by(trans) %>%
  summarise(counts = n())
df_trans

ggplot(df_trans, aes(x=trans, y=counts)) +
  geom_bar(fill = "#0073C2FF", stat="identity") +
  theme_minimal()

df_race <- cleandata %>%
  group_by(race) %>%
  summarise(counts = n())
df_race

ggplot(df_race, aes(x=race, y=counts)) +
  geom_bar(fill = "#0073C2FF", stat="identity") +
  theme_minimal()

df_age <- cleandata %>%
  group_by(agecat) %>%
  summarise(counts = n())
df_age

ggplot(df_age, aes(x=agecat, y=counts)) +
  geom_bar(fill = "#0073C2FF", stat="identity") +
  theme_minimal()

df_profID <- cleandata %>%
  group_by(profID) %>%
  summarise(counts = n())
df_profID

ggplot(df_profID, aes(x=profID, y=counts)) +
  geom_bar(fill = "#0073C2FF", stat="identity") +
  theme_minimal()

df_yrsprac <- cleandata %>%
  group_by(yrsprac) %>%
  summarise(counts = n())
df_yrsprac

ggplot(df_yrsprac, aes(x=yrsprac, y=counts)) +
  geom_bar(fill = "#0073C2FF", stat="identity") +
  theme_minimal()

df_region <- cleandata %>%
  group_by(region) %>%
  summarise(counts = n())
df_region

ggplot(df_region, aes(x=region, y=counts)) +
  geom_bar(fill = "#0073C2FF", stat="identity") +
  theme_minimal()

df_relationstatus <- cleandata %>%
  group_by(relationstatus) %>%
  summarise(counts = n())
df_relationstatus

ggplot(df_relationstatus, aes(x=relationstatus, y=counts)) +
  geom_bar(fill = "#0073C2FF", stat="identity") +
  theme_minimal()

Calculate scores from instruments

(this is the section that got moved to earlier in the script, had to rename coredata to cleandata)

Now we will score each respondent’s answers on the three instruments.

Note: The SELFEFF A-RES Self-Efficacy score for antiracist action is scaled such that LOWER numbers mean higher levels of self-efficacy. From the original paper: “Items were answered using a 4-point, fully-anchored, Likert-type scale ranging from Agree strongly (1) to Disagree strongly (4). Items were rescored so a higher number would indicate a more positively-valanced response.” Because of this, the items were similarly rescored here.

#### Recode the Likert responses ####
## from https://towardsdatascience.com/working-with-survey-data-clean-and-visualize-likert-scale-questions-in-r-6a78e3b9c7b2

## USED FOR PCRW & WPAS

likert_recode <- function(x) {
  as.numeric(case_when(
    x == "Strongly Disagree" ~ 1,
    x == "Moderately Disagree" ~ 2,
    x == "Slightly Disagree" ~ 3,
    x == "Slightly Agree" ~ 4,
    x == "Moderately Agree" ~ 5,
    x == "Strongly Agree" ~ 6,    
  ))
}

# Negative Recode 
likert_recode_negative <- function(x) {
  as.numeric(case_when(
    x == "Strongly Disagree" ~ 6,
    x == "Moderately Disagree" ~ 5,
    x == "Slightly Disagree" ~ 4,
    x == "Slightly Agree" ~ 3,
    x == "Moderately Agree" ~ 2,
    x == "Strongly Agree" ~ 1, 
  ))
}

#######
# Used for A-RES
# See note above from original published paper saying rescored
likert_recode_4 <- function(x) {
  as.numeric(case_when(
    x == "Agree strongly" ~ 4,
    x == "Agree" ~ 3,
    x == "Disagree" ~ 2,
    x == "Disagree strongly" ~ 1,
  ))
}

# Negative Recode 
likert_recode_negative_4 <- function(x) {
  as.numeric(case_when(
    x == "Agree strongly" ~ 1,
    x == "Agree" ~ 2,
    x == "Disagree" ~ 3,
    x == "Disagree strongly" ~ 4,
  ))
}


## PCRW
# This is the Respondent ID col1, plus the PCRW questions 01-16 columns
#pcrw_subset <- cleandata[ , c(1,PCRWQ01:PCRWQ16)]
pcrw_subset <- cleandata %>%
  select("Respondent ID", PCRWQ01:PCRWQ16)

# Questions 2, 8, and 12 are reverse-scored
# "Higher scores should indicate greater levels of costs of racism. There is no total score, so please use only subscale scores."

# Apply the recoding
pcrw_positive <- pcrw_subset %>%
  select(-'Respondent ID', -PCRWQ02, -PCRWQ08, -PCRWQ12) %>%
  mutate_all(likert_recode)

pcrw_negative <- pcrw_subset %>%
  select(PCRWQ02, PCRWQ08, PCRWQ12) %>%
  mutate_all(likert_recode_negative)

#Recombine data - but this just merges them without putting them in the correct order by Q#
pcrw_subset_recoded <- cbind(pcrw_positive, pcrw_negative)
#pcrw_subset_recoded  %>% select(order(colnames(pcrw_subset_recoded )))

pcrw_subset_recoded <- pcrw_subset_recoded %>%
  select(sort(tidyselect::peek_vars()))


# pcrw_wtanger_items <- pcrw_subset_recoded[ , c(1,3,6,10,14,16)]
# pcrw_wtguilt_items <- pcrw_subset_recoded[ , c(4,7,8,12,15)]
# pcrw_wtfear_items <- pcrw_subset_recoded[ , c(2,5,9,11,13)]


#### Calculate Totals ####
# Subscale Totals
#cleandata$pcrwsub1total <- rowSums(pcrw_subset_recoded) #Create column of Subscale 1 total per person

# White Empathy aka WTANGER in my study
#cleandata$WTANGER <- rowSums(pcrw_subset_recoded[ , c(1,3,6,10,14,16)], na.rm=TRUE) 
#cleandata$WTANGER <- rowMeans(pcrw_subset_recoded[ , c(1,3,6,10,14,16)], na.rm=TRUE)
cleandata$WTANGER <- rowMeans(pcrw_subset_recoded[c("PCRWQ01", "PCRWQ03", "PCRWQ06", "PCRWQ10", "PCRWQ14", "PCRWQ16")], na.rm=TRUE)


#Create column of PCRW Subscale 1 total per respondent

# White Guilt aka WTGUILT in my study
#cleandata$WTGUILT <- rowSums(pcrw_subset_recoded[ , c(4,7,8,12,15)], na.rm=TRUE) 
#cleandata$WTGUILT <- rowMeans(pcrw_subset_recoded[ , c(4,7,8,12,15)], na.rm=TRUE) 
cleandata$WTGUILT <- rowMeans(pcrw_subset_recoded[c("PCRWQ04", "PCRWQ07", "PCRWQ08", "PCRWQ12", "PCRWQ15")], na.rm=TRUE)
#Create column of PCRW Subscale 2 total per respondent



# White Fear aka WTFEAR in my study
#cleandata$WTFAR <- rowSums(pcrw_subset_recoded[ , c(2,5,9,11,13)], na.rm=TRUE) 
#cleandata$WTFEAR <- rowMeans(pcrw_subset_recoded[ , c(2,5,9,11,13)], na.rm=TRUE) #Create column of PCRW 
cleandata$WTFEAR <- rowMeans(pcrw_subset_recoded[c("PCRWQ02", "PCRWQ05", "PCRWQ09", "PCRWQ11", "PCRWQ13")], na.rm=TRUE)



# don't know if I need this?
#pcrw_subset_recoded <- cbind(id = pcrw_subset$'Respondent ID', pcrw_subset_recoded) #Recombine data with Respondent IDs

## A-RES: COMPETENCE 
# This is the Respondent ID col1, plus the A-RES questions 01-02 columns
# Only the Competence subscale is being used in this study

#aresc_subset <- cleandata[ , c(1,61:62)]
aresc_subset <- cleandata %>%
  select('Respondent ID', ARESQ01:ARESQ02) 


# This instrument is scaled opposite to the others

# Apply the recoding
aresc_subset <- aresc_subset %>%
  select(-'Respondent ID') %>%
  mutate_all(likert_recode_4)
# no negative-scored items in the Competence subscale



## A-RES: IMPACT 
# This is the Respondent ID col1, plus the A-RES Impact questions 03-04 columns
#aresi_subset <- cleandata[ , c(1,63:64)]
aresi_subset <- cleandata %>%
  select('Respondent ID', ARESQ03:ARESQ04) 

# This instrument is scaled opposite to the others
# Question 4 is reverse-scored

# Apply the recoding
aresi_positive <- aresi_subset %>%
  select(-'Respondent ID', -ARESQ04) %>%
  mutate_all(likert_recode_4)

aresi_negative <- aresi_subset %>%
  select(ARESQ04) %>%
  mutate_all(likert_recode_negative_4)

#Recombine data
aresi_subset_recoded <- cbind(aresi_positive, aresi_negative)
###NOT USING THIS 'IMPACT' VARIABLE - NOT RECORDING IT INTO THE OUTPUT

head(aresi_subset_recoded)

#### Calculate Totals ####
# Subscale Totals
#cleandata$SELFEFF <- rowSums(aresc_subset) 
cleandata$SELFEFF <- rowMeans(aresc_subset) 
#Create column of subscale COMPETENCE total per person

##############-----------------------

## WPAS 
# Dependent (Outcome) Variable
# This is the Respondent ID col1, plus the WPAS questions 01-12 columns
#wpas_subset <- cleandata[ , c(1,48:59)]
wpas_subset <- cleandata %>%
  select('Respondent ID', WPASQ01:WPASQ12) 


# This study uses Subscale 1 only, Confronting White Privilege, renamed to Confronting Racism

# Questions 4 and 7 are reverse-scored
# "Higher scores correspond with higher levels of acknowledgement of [racism]."

# Apply the recoding
wpas_positive <- wpas_subset %>%
  select(-'Respondent ID', -WPASQ04, -WPASQ07) %>%
  mutate_all(likert_recode)

wpas_negative <- wpas_subset %>%
  select(WPASQ04, WPASQ07) %>%
  mutate_all(likert_recode_negative)

#Recombine data
wpas_subset_recoded <- cbind(wpas_positive, wpas_negative)

#### Calculate Totals ####
# Subscale Totals
#cleandata$ACCOUNT <- rowSums(wpas_subset_recoded) 
cleandata$ACCOUNT <- rowMeans(wpas_subset_recoded) 
#Create column of Subscale 1 total per person

# Set labels on the data
cleandata <- cleandata %>%
  set_variable_labels(WTANGER = "PCRW White Anger", WTGUILT = "PCRW White Guilt", WTFEAR = "PCRW White Fear", SELFEFF = "A-RES Antiracist Self Efficacy", ACCOUNT = "WPAS Antiracist Accountability")

Quick: Run an ANOVA against the profID field

This may be useful for a future study on the same dataset, looking at differences between the different professional identities.

#do the analysis of variance
aov.prof = aov(ACCOUNT~profID,data = cleandata)  
#show the summary table
summary(aov.prof)

##              Df Sum Sq Mean Sq F value Pr(>F)
## profID        3   0.61  0.2029   0.603  0.614
## Residuals   125  42.07  0.3366

print(model.tables(aov.prof,"means"),digits=3)

## Tables of means
## Grand mean
##          
## 5.393411 
## 
##  profID 
##     counselor counselor educator social worker therapist
##          5.45               5.48          5.24      5.35
## rep     55.00              11.00         10.00     53.00

    #report the means and the number of subjects/cell
boxplot(ACCOUNT~profID,data = cleandata)

    #graphical summary appears in graphics window

Report participants’ details

# Using package report


# gender is col 4, omitting it as all are women
# spell_n is about output numbers as words vs digits
paste(
  report_participants(data = subset(cleandata[ c(7,9,10,12, 13, 14, 15)], age = agecat, group = region, spell_n = TRUE),
  "were recruited in the study by means of torture and coercion.")
)

## Warning: In subset.data.frame(cleandata[c(7, 9, 10, 12, 13, 14, 15)], age = agecat, 
##     group = region, spell_n = TRUE) :
##  extra arguments 'age', 'group', 'spell_n' will be disregarded

## [1] "129 participants (Education: Bachelor degree, 7.75%; Doctorate degree, 14.73%; Master's degree, 76.74%; Skip/Prefer not to say, 0.78%)"

# <!--
# 
# The dataset was comprised of `r report_participants(cleandata, age = agecat, group = region, spell_n = TRUE)`. 
# 
# -->

report_sample(cleandata, group_by = "profID", select = c("WTANGER", "WTGUILT", "WTFEAR", "SELFEFF", "ACCOUNT"))

library(apaTables)
table1 <- apa.cor.table(cleandata, 
                        table.number=1)

print(table1)

## 
## 
## Table 1 
## 
## Descriptive Statistics and Correlations
##  
## 
##   Variable   N   M    SD   1           2           3            4         
##   1. WTANGER 129 5.32 0.52                                                
##                                                                           
##                                                                           
##                                                                           
##   2. WTGUILT 129 3.81 1.07 .21*                                           
##                            [.04, .37]                                     
##                            p = .017                                       
##                                                                           
##   3. WTFEAR  129 2.07 0.66 .04         .19*                               
##                            [-.13, .21] [.02, .35]                         
##                            p = .660    p = .031                           
##                                                                           
##   4. SELFEFF 129 3.33 0.46 .11         .01         -.37**                 
##                            [-.07, .28] [-.16, .18] [-.51, -.21]           
##                            p = .223    p = .896    p < .001               
##                                                                           
##   5. ACCOUNT 129 5.39 0.58 .30**       .26**       -.33**       .53**     
##                            [.13, .45]  [.09, .42]  [-.47, -.16] [.39, .64]
##                            p < .001    p = .003    p < .001     p < .001  
##                                                                           
## 
## Note. N = number of cases. M = mean. SD = standard deviation.
## Values in square brackets indicate the 95% confidence interval.
##  * indicates p < .05. ** indicates p < .01.
##

# apa.save(filename = "table1.doc",
#          table1)

#apa.knit.table.for.pdf(table1)

Experimenting with apaTables

Here is a correlations table

cleandata %>%
  select(WTANGER, WTGUILT, WTFEAR, SELFEFF, ACCOUNT) %>%
  apa.cor.table()

## 
## 
## Table 0 
## 
## Descriptive Statistics and Correlations
##  
## 
##   Variable   N   M    SD   1           2           3            4         
##   1. WTANGER 129 5.32 0.52                                                
##                                                                           
##                                                                           
##                                                                           
##   2. WTGUILT 129 3.81 1.07 .21*                                           
##                            [.04, .37]                                     
##                            p = .017                                       
##                                                                           
##   3. WTFEAR  129 2.07 0.66 .04         .19*                               
##                            [-.13, .21] [.02, .35]                         
##                            p = .660    p = .031                           
##                                                                           
##   4. SELFEFF 129 3.33 0.46 .11         .01         -.37**                 
##                            [-.07, .28] [-.16, .18] [-.51, -.21]           
##                            p = .223    p = .896    p < .001               
##                                                                           
##   5. ACCOUNT 129 5.39 0.58 .30**       .26**       -.33**       .53**     
##                            [.13, .45]  [.09, .42]  [-.47, -.16] [.39, .64]
##                            p < .001    p = .003    p < .001     p < .001  
##                                                                           
## 
## Note. N = number of cases. M = mean. SD = standard deviation.
## Values in square brackets indicate the 95% confidence interval.
##  * indicates p < .05. ** indicates p < .01.
##

PRACTICING WITH NICE_TABLE

# code from https://github.com/rempsyc/rempsyc/issues/9 

conflicts_prefer(dplyr::lag)

## [conflicted] Will prefer dplyr::lag over any other package.

cleandata %>% 
  pivot_longer(cols = c(agecat, sexuality, relationstatus, education, class, profID, yrsprac), names_to = "Characteristics", values_to = "Category") %>% 
  count(Characteristics, Category, name = "Frequency") %>% 
  arrange(factor(Characteristics, levels = c("Age", "Sexual Identity", "Relationship Status", "Highest Education Completed", "SES", "Professional Identity", "Years Practicing"))) %>% 
  mutate(Percentage = prop.table(Frequency) * 100,
         Category = as.character(Category),
         Characteristics = ifelse(Characteristics == lag(
           Characteristics, default = ""), "", Characteristics)) %>%
 # nice_table(title = c("Table 1", "Sample demographic characteristics (N = 877)"))  %>%
  as_flextable() %>%
  save_as_image(path = "sampledemographics2.png")

## Warning: ftExtra:::as_flextable.data.frame is deprecated and will be removed in
## the future release. Consider using flextalbe's implementation by running
## `.S3method("as_flextable", "data.frame", flextable:::as_flextable.data.frame)`

## [1] "sampledemographics2.png"

Limit Dataset to Desired Participant Attributes

Next we will choose which responses to filter out. Choices coded here support the study criteria: self identified white, self identified woman, self-identified counselors. These can be reconfigured using the flags at the top of the script and the analysis re-run accordingly. (Possible to do later: Set up parameterized reporting)

## LIMIT THE DATASET

# This data will be reduced to a subset called coredata for this configured analysis
coredata <- cleandata

# Next, remove disqualifying responses
# Use flags at top of the script to turn these on or off
# 1) Not self-identifying as woman 

if(include.woman == TRUE) {
  coredata <- coredata %>% 
    filter(
      (gender == 'woman')
    )
}
print(nrow(coredata))

## [1] 129

# 2) Not self-identifying solely as white 
if(include.white == TRUE && include.alternatives == FALSE) {
  coredata <- coredata %>% 
    filter(
      (race == 'White / European-American')
    )
} else if(include.white == TRUE && include.alternatives == TRUE) {
  coredata <- coredata %>% 
    filter(
      (race == 'White / European-American' | race == 'Other White')
    )
}



# 3) Not identifying as counselor and/or counselor educator
if(counselingprofessionalsonly == TRUE) {
  coredata <- coredata %>%
    filter(
      (profID == 'counselor' | profID == 'counselor educator')
    )
} else {
# pull in only the ones flagged at top of script  
   if(include.counselors != TRUE) {
    coredata <- coredata %>%
       filter(
         (profID != 'counselor')
       )
   }   
   if(include.counseloreducators != TRUE) {
    coredata <- coredata %>%
       filter(
         (profID != 'counselor educator')
       )
   }    
  if(include.therapists != TRUE) {
    coredata <- coredata %>%
       filter(
         (profID != 'therapist')
       )
   }

   if(include.psychologists != TRUE) {
     coredata <- coredata %>%
       filter(
         (profID != 'psychologist')
       )
   }
   if(include.socialworkers != TRUE) {
     coredata <- coredata %>%
       filter(
         (profID != 'social worker')
       )
   }
}
  
observations = nrow(coredata)
print(observations)

## [1] 64

Number of observations in subset to be analyzed

For the remainder of this output, the number of observations included in the analysis is: 64

This current dataset includes:

only self-identified women = TRUE
only self-identified white = TRUE
only counselors and counselor educators = TRUE OR
- counselors = TRUE
- counselor educators = TRUE
- therapists = FALSE
- social workers = FALSE
- psychologists = FALSE

df_trans2 <- coredata %>%
  group_by(trans) %>%
  summarise(counts = n())
df_trans2

ggplot(df_trans2, aes(x=trans, y=counts)) +
  geom_bar(fill = "#0073C2FF", stat="identity") +
  theme_minimal()

df_race2 <- coredata %>%
  group_by(race) %>%
  summarise(counts = n())
df_race2

ggplot(df_race2, aes(x=race, y=counts)) +
  geom_bar(fill = "#0073C2FF", stat="identity") +
  theme_minimal()

df_age2 <- coredata %>%
  group_by(agecat) %>%
  summarise(counts = n())
df_age2

ggplot(df_age2, aes(x=agecat, y=counts)) +
  geom_bar(fill = "#0073C2FF", stat="identity") +
  theme_minimal()

df_yrsprac2 <- coredata %>%
  group_by(yrsprac) %>%
  summarise(counts = n())
df_yrsprac2

ggplot(df_yrsprac2, aes(x=yrsprac, y=counts)) +
  geom_bar(fill = "#0073C2FF", stat="identity") +
  theme_minimal()

df_region2 <- coredata %>%
  group_by(region) %>%
  summarise(counts = n())
df_region2

ggplot(df_region2, aes(x=region, y=counts)) +
  geom_bar(fill = "#0073C2FF", stat="identity") +
  theme_minimal()

df_relationstatus2 <- coredata %>%
  group_by(relationstatus) %>%
  summarise(counts = n())
df_relationstatus2

ggplot(df_relationstatus2, aes(x=relationstatus, y=counts)) +
  geom_bar(fill = "#0073C2FF", stat="identity") +
  theme_minimal()

K-means clustering per Spanierman

They found 5 clusters in Spanierman et al. (2006).

The below analysis is done to follow the standard set by Soble et al. (2011) in how they did their analysis of the PCRW instrument as part of their study. It is intended to check whether the current sample matches the characteristics identified from the sample used in development of the PCRW.

From Spanierman et al. (2012): “Similar to previous studies (e.g., Spanierman et al. 2006), a cluster group’s subscale score is considered “high” if it falls within the upper one third of the overall distribution of sample subscale scores. A “moderate” subscale score is one that falls within the middle third of the overall distribution of subscale scores, and a subscale score that falls within the bottom third of the distribution is con- sidered “low” (Spanierman et al. 2006). Although not iden- tical, the five cluster groups from initial studies and the present study displayed similar overall mean patterns of White empathy, White guilt, and White fear. We therefore continued our investigation to examine differences in cluster types by gender.”

“To perform k-means clustering in R we can use the built-in kmeans() function”.

library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(cluster)

# from https://www.statology.org/k-means-clustering-in-r/ 

# 1. Number of Clusters vs. the Total Within Sum of Squares

matrixdata <- select(coredata, c('WTANGER':'WTFEAR'))
fviz_nbclust(matrixdata, kmeans, method = "wss")

# Perform K-Means Clustering with Optimal K
#make this example reproducible
set.seed(1)

# this line frmo https://stackoverflow.com/questions/21858124/r-error-in-glmnet-na-nan-inf-in-foreign-function-call

# this would work on the individual PCRW items
#xsub <- model.matrix( ~ ., cleandata[ , c(32:47)])

# this works on just the raw score means (i.e., scale score divided by number of items in scale) for each subscale) - this is done in @spanierman2006
#xsub <- model.matrix( ~ ., cleandata[ , c(65:67)])

xsub <- model.matrix( ~ ., matrixdata) 




#perform k-means clustering with k = 5 clusters on the PCRW items only
km <- kmeans(xsub, centers = 5, nstart = 25)

#view results
km

## K-means clustering with 5 clusters of sizes 6, 4, 26, 16, 12
## 
## Cluster means:
##   (Intercept)  WTANGER  WTGUILT   WTFEAR
## 1           1 4.472222 4.766667 2.133333
## 2           1 5.291667 3.350000 3.400000
## 3           1 5.416667 3.523077 1.669231
## 4           1 5.604167 4.887500 1.887500
## 5           1 5.013889 2.000000 1.783333
## 
## Clustering vector:
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
##  4  3  3  1  3  3  3  5  1  3  3  2  3  4  3  3  5  5  1  5  3  5  5  5  4  4 
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 
##  5  4  3  4  3  3  4  3  4  2  4  5  3  2  3  3  3  5  4  5  1  4  4  3  3  1 
## 53 54 55 56 57 58 59 60 61 62 63 64 
##  3  3  5  3  4  1  2  4  3  3  4  4 
## 
## Within cluster sum of squares by cluster:
## [1]  2.665370  3.541944 10.957650  8.170278 11.868796
##  (between_SS / total_SS =  69.1 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

# WTANGER
quantile(coredata$WTANGER, probs = seq(0, 1, 1/3))

##        0% 33.33333% 66.66667%      100% 
##  3.500000  5.166667  5.500000  6.000000

# WTGUILT
quantile(coredata$WTGUILT, probs = seq(0, 1, 1/3))

##        0% 33.33333% 66.66667%      100% 
##       1.2       3.2       4.4       5.8

#WTFEAR
quantile(coredata$WTFEAR, probs = seq(0, 1, 1/3))

##        0% 33.33333% 66.66667%      100% 
##       1.0       1.6       2.0       4.0

Cluster means: ## WTANGER WTGUILT WTFEAR ## 1 Insensitive and afraid (4) 5.291667 = moderate 3.350000 = moderate 3.400000 = high ## 2 Unempathic and unaware (12) 5.013889 = low 2.000000 = low 1.783333 = low ## 3 Empathic but unaccountable. (26) 5.429487 = moderate 3.476923 = moderate 1.684615 = low ## 4 Fearful guilt (6) 4.472222 = low 4.766667 = high 2.133333 = high ## 5 Informed empathy and guilt. (17) 5.588235 = high 4.847059 = high 1.882353 = moderate

Consistency checks - Cronbach’s alpha

Coefficient alpha is a measure of internal consistency, or how closely related a set of items are as a group.

NO - SKIP THIS Another measure of internal consistency promoted by the psych package authors is McDonald’s coefficient omega

See https://www.personality-project.org/r/psych/HowTo/omega.pdf linked from https://www.personality-project.org/r/psych/

Omega output by psych package (SKIP)

Find omega on the A-RES scale

omega(wpas_subset_recoded, check.keys=F)

PCRW Subscales

calculate Cronbach’s alpha of the three PCRW subscale responses.

COME BACK TO THIS

# pcrw_wtanger_items <- pcrw_subset_recoded[ , c(1,3,6,10,14,16)]
# pcrw_wtguilt_items <- pcrw_subset_recoded[ , c(4,7,8,12,15)]
# pcrw_wtfear_items <- pcrw_subset_recoded[ , c(2,5,9,11,13)]

head(pcrw_wtanger_items)

CronbachWTANGER <- psych::alpha(pcrw_wtanger_items, check.keys=T)
CronbachWTGUILT <- psych::alpha(pcrw_wtguilt_items, check.keys=T)
CronbachWTFEAR <- psych::alpha(pcrw_wtfear_items, check.keys=T)

# inspect results
CronbachWTANGER
CronbachWTGUILT
CronbachWTFEAR

Modified WPAS instrument

calculate Cronbach’s alpha of the WPAS instrument responses.

This is particularly important due to substitution of wording from white privilege in original items of validated WPAS instrument, to the term racism in current study, however maybe not necessary since the WPAS was validated and reliabiity confirmed when developed by original authors?

Cronbach <- psych::alpha(wpas_subset_recoded, check.keys=F)
# Cronbach <- psych::alpha(coredata[c("WPASQ01",   
#                    "WPASQ02",  
#                    "WPASQ03",  
#                    "WPASQ04",  
#                    "WPASQ05",
#                    "WPASQ06",
#                    "WPASQ07",
#                    "WPASQ08",
#                    "WPASQ09",
#                    "WPASQ10",
#                    "WPASQ11",
#                    "WPASQ12")], check.keys=F)
# inspect results
Cronbach

## 
## Reliability analysis   
## Call: psych::alpha(x = wpas_subset_recoded, check.keys = F)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N   ase mean   sd median_r
##        0.9      0.91    0.93      0.44 9.5 0.014  5.4 0.58     0.46
## 
##     95% confidence boundaries 
##          lower alpha upper
## Feldt     0.87   0.9  0.92
## Duhachek  0.87   0.9  0.92
## 
##  Reliability if an item is dropped:
##         raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## WPASQ01      0.88      0.89    0.92      0.43 8.3    0.016 0.023  0.44
## WPASQ02      0.89      0.90    0.93      0.46 9.2    0.015 0.022  0.48
## WPASQ03      0.88      0.89    0.92      0.43 8.2    0.016 0.023  0.42
## WPASQ05      0.88      0.89    0.92      0.43 8.3    0.015 0.023  0.44
## WPASQ06      0.88      0.89    0.92      0.43 8.1    0.016 0.023  0.43
## WPASQ08      0.89      0.90    0.92      0.44 8.8    0.015 0.022  0.47
## WPASQ09      0.89      0.90    0.93      0.46 9.3    0.014 0.021  0.47
## WPASQ10      0.89      0.90    0.92      0.44 8.6    0.015 0.021  0.46
## WPASQ11      0.89      0.90    0.92      0.45 8.9    0.015 0.021  0.47
## WPASQ12      0.89      0.90    0.92      0.44 8.6    0.015 0.022  0.46
## WPASQ04      0.90      0.90    0.92      0.46 9.4    0.013 0.019  0.48
## WPASQ07      0.89      0.90    0.92      0.45 9.1    0.014 0.023  0.47
## 
##  Item statistics 
##           n raw.r std.r r.cor r.drop mean   sd
## WPASQ01 129  0.80  0.77  0.76   0.74  5.2 0.90
## WPASQ02 129  0.64  0.62  0.57   0.55  5.3 0.84
## WPASQ03 129  0.79  0.79  0.78   0.74  5.2 0.80
## WPASQ05 129  0.75  0.77  0.75   0.70  5.7 0.66
## WPASQ06 129  0.81  0.81  0.80   0.76  5.4 0.82
## WPASQ08 129  0.65  0.69  0.66   0.59  5.7 0.64
## WPASQ09 129  0.57  0.59  0.55   0.48  5.5 0.80
## WPASQ10 129  0.68  0.73  0.71   0.63  5.7 0.63
## WPASQ11 129  0.71  0.67  0.65   0.63  5.0 0.97
## WPASQ12 129  0.69  0.73  0.71   0.63  5.6 0.74
## WPASQ04 129  0.64  0.58  0.55   0.52  4.9 1.25
## WPASQ07 129  0.64  0.64  0.60   0.55  5.6 0.93
## 
## Non missing response frequency for each item
##            1    2    3    4    5    6 miss
## WPASQ01 0.01 0.00 0.04 0.14 0.40 0.42    0
## WPASQ02 0.00 0.02 0.01 0.14 0.37 0.47    0
## WPASQ03 0.00 0.02 0.01 0.11 0.45 0.42    0
## WPASQ05 0.00 0.01 0.01 0.04 0.17 0.78    0
## WPASQ06 0.00 0.01 0.01 0.15 0.29 0.54    0
## WPASQ08 0.00 0.01 0.01 0.03 0.17 0.78    0
## WPASQ09 0.00 0.01 0.02 0.08 0.24 0.65    0
## WPASQ10 0.00 0.00 0.02 0.04 0.21 0.74    0
## WPASQ11 0.00 0.03 0.02 0.20 0.38 0.36    0
## WPASQ12 0.00 0.00 0.02 0.08 0.19 0.71    0
## WPASQ04 0.02 0.04 0.07 0.15 0.31 0.41    0
## WPASQ07 0.01 0.02 0.01 0.06 0.15 0.75    0

Internal consistency confirmed. Results show Cronbach’s alpha > 0.7 so items in the WPAS instrument are internally reliable (even with the modification).

Look for outliers

The diagnostic plots to check if the assumptions for regression are met show several outliers. An examination of these identified observations themselves did not reveal any evidence that they are errors. For example, the start time and end time of the submissions are similar to other observations, so there is no indication that the respondent sped through quickly without actually reading the item, such as might occur with a participant interested in completing in order to enter the raffle.

Let’s examine the answers to each of the 12 WPAS questions. Questions 4 and 7 are reverse-coded.

# https://larmarange.github.io/ggstats/articles/gglikert.html


# This mutate step is required to put the levels in the correct order
# Don't do it across all columns though - this makes all datatypes into factors
# coredata2 <- coredata %>%
#     mutate(across(everything(), ~ factor(.x, levels = likert_levels_plot)))

# coredata2 <- coredata %>%
#     mutate(across(WPASQ01:WPASQ12, ~ factor(.x, levels = likert_levels_plot)))

# Below code superseded by global labels function implemented in earlier chunk
# coredata2 <- coredata2 %>%
#   set_variable_labels(
#     WPASQ01 = paste("Q1 ", originalsm.labels[48]),
#     WPASQ02 = paste("Q2 ", originalsm.labels[49]),
#     WPASQ03 = paste("Q3 ", originalsm.labels[50]),
#     WPASQ04 = paste("Q4 ", originalsm.labels[51]), #reverse
#     WPASQ05 = paste("Q5 ", originalsm.labels[52]),
#     WPASQ06 = paste("Q6 ", originalsm.labels[53]),
#     WPASQ07 = paste("Q7 ", originalsm.labels[54]), #reverse
#     WPASQ08 = paste("Q8 ", originalsm.labels[55]),
#     WPASQ09 = paste("Q9 ", originalsm.labels[56]),
#     WPASQ10 = paste("Q10 ", originalsm.labels[57]),
#     WPASQ11 = paste("Q11 ", originalsm.labels[58]),
#     WPASQ12 = paste("Q12 ", originalsm.labels[59])
#     )
# 
# vt(coredata2)

if(counselingprofessionalsonly) {

counsonly <- coredata %>%
  filter(profID == "counselor" | profID == "counselor educator")

## below is one question at a time, with breakdown by couns vs counsed
# gglikert(counsonly, 
#          include = c(WPASQ01), 
#          y = "profID", 
#          facet_rows = vars(.question),
#          labels_size = 3,
#          facet_label_wrap = 15) +
#    ggtitle("WPAS Q1 only") +
#     scale_fill_brewer(palette = "RdYlBu", direction =-1)

# Below is full set of WPAS questions, for counselors and counselor educators only
# Use this line if want to do the group-by on profID
#gglikert(counsonly, include = c(WPASQ01:WPASQ03, WPASQ05:WPASQ06, WPASQ08:WPASQ12),   facet_rows = vars(profID)) +

gglikert(counsonly, include = c(WPASQ01:WPASQ03, WPASQ05:WPASQ06, WPASQ08:WPASQ12)) +
   ggtitle(paste("Items comprising the ACCOUNT var,", nrow(counsonly), "obs"), subtitle = "Standard-coded questions") +
#    scale_x_continuous ( labels = label_percent_abs(), 
#                         expand = expansion(0,.2))
    scale_fill_brewer(palette = "RdYlBu", direction =-1)

gglikert(coredata, include = c(WPASQ04,WPASQ07)) +
  ggtitle(paste("Items comprising the ACCOUNT var,", nrow(counsonly), "obs", subtitle = "Reverse-coded questions")) +  
  scale_fill_brewer(palette = "RdYlBu", direction =1)

}

## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.

Examine sample for model assumptions

Per Flamez et al. (2017), before parametric analysis, the dataset needs to be examined to confirm it meets model assumptions, specifically for participant independence, that the dataset is normally distributed, and also homogeneity of covariances.

Histograms on the sample variables

I understand now that I don’t need to worry about whether the sample variables are normally distributed or not. I am keeping this in this working notebook just to preserve the code/my learning process. The main interesting one is ACCOUNT, the last one, which shows some observations in the tail (left side of x axis) which are worth investigating further.

p <- ggplot(coredata, aes(x=WTANGER)) + geom_histogram() 

p+ geom_vline(aes(xintercept=mean(WTANGER)),
            color="blue", linetype="dashed", linewidth=1)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Histogram with density plot
ggplot(coredata, aes(x=WTANGER)) +
 geom_histogram(aes(y=after_stat(density)), colour="black", fill="white")+
 geom_density(alpha=.2, fill="#FF6666")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

hist(coredata$WTANGER)

hist(coredata$WTGUILT)

hist(coredata$WTFEAR)

hist(coredata$SELFEFF)

hist(coredata$ACCOUNT)

Analysis -

The histogram of the dependent variable ACCOUNT shows that that data is not normal.

Descriptive Statistics NEW

descriptivestats <- report_sample(coredata, select = c("WTANGER", "WTGUILT", "WTFEAR", "SELFEFF", "ACCOUNT"))
descriptivestats %>%
  flextable() %>%
  set_header_labels(Summary = "Mean (Std Dev)", values = NULL)

Variable	Mean (Std Dev)
Mean WTANGER (SD)	5.29 (0.55)
Mean WTGUILT (SD)	3.68 (1.12)
Mean WTFEAR (SD)	1.90 (0.60)
Mean SELFEFF (SD)	3.34 (0.45)
Mean ACCOUNT (SD)	5.46 (0.59)

coredata %>% 
  select(WTANGER:ACCOUNT) %>%
  describe(fast = TRUE) %>%
  round(digits = 2) %>%
  flextable()

vars	n	mean	sd	min	max	range	se
1	64	5.29	0.55	3.50	6.0	2.50	0.07
2	64	3.68	1.12	1.20	5.8	4.60	0.14
3	64	1.90	0.60	1.00	4.0	3.00	0.08
4	64	3.34	0.45	2.00	4.0	2.00	0.06
5	64	5.46	0.59	2.83	6.0	3.17	0.07

this was deleted from the manuscript 2-19

Turned eval=FALSE for these because they aren’t running in this notebook, the code does work in the Results chapter. This had been in the Linear Regression section.

[Editing Note: Including the plots below, unsure if I will include them in the actual dissertation. Thoughts? I don’t think I can easily extract them to create their own figure #s and recreating each from scratch would be quite a pain. Can I just say that I ran the diagnostics and describe my assessment?]

[Editing Note: again including these?]

Messiness below

I’m trying to figure out what type of regression to run, so here’s what I understand so far. I figured out that it’s not a multinomial regression, since I only have one DV. I was told to do a hierarchical regression, which I understand to be a LM where the IVs are named in order of importance, based on my understanding of the subject of study. I also discovered stepwise regression which seems to be an avenue to explore? See below for some of that.

What type of variables are these? Since each IV and the DV are calculated from the Likert-style ordinal values on the items on each instrument, are my IV and DV also ordinal, even though they are now represented as integers?

I looked at a lot of sources to figure this out and got conflicting information. For example, here is a discussion of when to use Likert scale responses as ordinal or interval: (https://stats.stackexchange.com/questions/10/under-what-conditions-should-likert-scales-be-used-as-ordinal-or-interval-data). The Likert responses in this study’s instruments are calculated as averages of the scores by scale. My takeaway from that post is that the difference between each possible item on the Likert scales cannot be measured, which makes these data ordinal. (It would only be interval if it could be precisely measured and was the same between each.)

See this: “When using summated scale scores (i.e., we add up score on each item to compute a”total score”), usual statistics may be applied, but you have to keep in mind that you are now working with a latent variable so the underlying construct should make sense! In psychometrics, we generally check that (1) unidimensionnality of the scale holds, (2) scale reliability is sufficient. When comparing two such scale scores (for two different instruments), we might even consider using attenuated correlation measures instead of classical Pearson correlation coefficient. ”

“The model fit was evaluated by examining the Chi-squared (χ2) statistic with a p-value greater than 0.05, the comparative fit index (CFI), the goodness of fit index (GFI), the root mean square error of approximation (RMSEA) and the χ2/df ratio (Hooper et al., 2008). The bootstrap method was employed for significance testing of the total effect, direct effect and total indirect effect.” from Yao et al. (2021) who proposed a model and then used SEM to validate.

This site specifically talks about analyzing Likert data in R with many useful code snippets.

Conclusion: Because the outcome variable is ordinal, and there are multiple independent variables, then this study requires multiple ordinal regression. (Verify this.) (Learning resource: Multiple ordinal regression https://www.youtube.com/watch?v=aiQcFCFtT5k )

ACTUALLY the exact-opposite argument is made here, that summed totals of individual Likert-type questions become interval data. https://stats.stackexchange.com/questions/398046/does-the-total-of-ordinal-scores-on-a-questionnaire-become-scale-in-spss

Here is a quote from an answer on that post: “The upshot here is that in order to do the addition, you had to implicitly convert those categories to numeric values. Often, if we analyzing individual Likert-type items we try to treat the data ordinal, but if we have several items summed into a scale, we treat the result as interval.” So that states that all of my variables are interval. HELP.

Should I run a cumulative odds ratio to understand how much one variable impacts the outcome variable when holding others constant. If that’s a good idea, what will it tell me?

Question: Because the scale averages are calculated from the item responses, each of my variables is a latent variable (constructed variable). Right? Or are they observed? Do I care? I think this only matters if I do path analysis (?).

#justvars <- coredata[ , c(65:69)]

justvars <- coredata %>%
  select(WTANGER:ACCOUNT)

head(justvars)

round(describe(justvars),2)

write_csv(justvars, "data\\justvars.csv")

Correlation matrix with values rounded to 3 decimals

round(cor(justvars),3)

##         WTANGER WTGUILT WTFEAR SELFEFF ACCOUNT
## WTANGER   1.000   0.168 -0.023   0.232   0.419
## WTGUILT   0.168   1.000  0.105   0.003   0.186
## WTFEAR   -0.023   0.105  1.000  -0.329  -0.395
## SELFEFF   0.232   0.003 -0.329   1.000   0.617
## ACCOUNT   0.419   0.186 -0.395   0.617   1.000

There appears to be a decent correlation between SELFEFF and ACCOUNT in every analysis run, regardless of which subset of the data. WTANGER and WTGUILT also have a high correlation which is not surprising given what they measure, though might be an issue in the model. Later on I do an assessment for multicollinearity which does not appear to be a problem.

Basic scatter plots

# Basic scatter plots
ggplot(justvars, aes(x=WTFEAR, y=ACCOUNT)) + 
  geom_point() +
  geom_smooth(method='lm')

## `geom_smooth()` using formula = 'y ~ x'

ggplot(justvars, aes(x=WTGUILT, y=ACCOUNT)) + 
  geom_point()  +
  geom_smooth(method='lm')

## `geom_smooth()` using formula = 'y ~ x'

ggplot(justvars, aes(x=WTANGER, y=ACCOUNT)) + 
  geom_point() +
  geom_smooth(method='lm')

## `geom_smooth()` using formula = 'y ~ x'

ggplot(justvars, aes(x=SELFEFF, y=ACCOUNT)) + 
  geom_point() +
  geom_smooth(method='lm')

## `geom_smooth()` using formula = 'y ~ x'

RANDOM

Basic linear regression on ACCOUNT

This is the standard LM function in R with all independent variables regressed against Antiracist Accountability (ACCOUNT).

First look at specifically the Residuals vs Fitted.

Now with model performance

## Running the lm using absolute value of WTGUILT was explored based on suggestion of chatbot, thinking that scatterplot of ACCOUNT to WTGUILT was v-shaped. The results were identical with abs WTGUILT. Then tried the transformation which did give different results however WTGUILT not statistically significant in that model, and R-sq went down. Skipping that transformation as not additive.

# coredata <- coredata %>%
#   mutate(abs_WTGUILT = abs(WTGUILT))
# 
# plot(coredata$abs_WTGUILT, coredata$ACCOUNT)
# 
# coredata <- coredata %>%
#   mutate(reciprocal_WTGUILT = 1/WTGUILT)
# 
# antiracism.lm <- lm(ACCOUNT ~ SELFEFF + WTANGER + reciprocal_WTGUILT + WTFEAR, data = coredata)


# HERE IS MY STANDARD LM
antiracism.lm <- lm(ACCOUNT ~ SELFEFF + WTANGER + WTGUILT + WTFEAR, data = coredata)

model_performance(antiracism.lm)

summary(antiracism.lm)

## 
## Call:
## lm(formula = ACCOUNT ~ SELFEFF + WTANGER + WTGUILT + WTFEAR, 
##     data = coredata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.49631 -0.22641  0.03488  0.31135  0.73861 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.94789    0.65421   2.977  0.00421 ** 
## SELFEFF      0.62341    0.12829   4.859 9.07e-06 ***
## WTANGER      0.29915    0.10040   2.979  0.00419 ** 
## WTGUILT      0.08707    0.04797   1.815  0.07462 .  
## WTFEAR      -0.24593    0.09254  -2.657  0.01011 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4165 on 59 degrees of freedom
## Multiple R-squared:  0.5355, Adjusted R-squared:  0.504 
## F-statistic: 17.01 on 4 and 59 DF,  p-value: 2.515e-09

residuals <- resid(antiracism.lm)
report(antiracism.lm)

## We fitted a linear model (estimated using OLS) to predict ACCOUNT with SELFEFF,
## WTANGER, WTGUILT and WTFEAR (formula: ACCOUNT ~ SELFEFF + WTANGER + WTGUILT +
## WTFEAR). The model explains a statistically significant and substantial
## proportion of variance (R2 = 0.54, F(4, 59) = 17.01, p < .001, adj. R2 = 0.50).
## The model's intercept, corresponding to SELFEFF = 0, WTANGER = 0, WTGUILT = 0
## and WTFEAR = 0, is at 1.95 (95% CI [0.64, 3.26], t(59) = 2.98, p = 0.004).
## Within this model:
## 
##   - The effect of SELFEFF is statistically significant and positive (beta = 0.62,
## 95% CI [0.37, 0.88], t(59) = 4.86, p < .001; Std. beta = 0.47, 95% CI [0.28,
## 0.66])
##   - The effect of WTANGER is statistically significant and positive (beta = 0.30,
## 95% CI [0.10, 0.50], t(59) = 2.98, p = 0.004; Std. beta = 0.28, 95% CI [0.09,
## 0.46])
##   - The effect of WTGUILT is statistically non-significant and positive (beta =
## 0.09, 95% CI [-8.93e-03, 0.18], t(59) = 1.81, p = 0.075; Std. beta = 0.16, 95%
## CI [-0.02, 0.35])
##   - The effect of WTFEAR is statistically significant and negative (beta = -0.25,
## 95% CI [-0.43, -0.06], t(59) = -2.66, p = 0.010; Std. beta = -0.25, 95% CI
## [-0.44, -0.06])
## 
## Standardized parameters were obtained by fitting the model on a standardized
## version of the dataset. 95% Confidence Intervals (CIs) and p-values were
## computed using a Wald t-distribution approximation.

# Below works just fine, but the standard plot outputs gives more detail
# plot(fitted(antiracism.lm), residuals)

Basic linear regression without WTGUILT

Is this model a better fit?

Now with model performance

antiracismwithoutguilt.lm<-lm(ACCOUNT ~ SELFEFF + WTANGER + WTFEAR, data = coredata)
model_performance(antiracismwithoutguilt.lm)

summary(antiracismwithoutguilt.lm)

## 
## Call:
## lm(formula = ACCOUNT ~ SELFEFF + WTANGER + WTFEAR, data = coredata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.52698 -0.20677  0.03187  0.29712  0.80873 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.07535    0.66275   3.131  0.00269 ** 
## SELFEFF      0.62314    0.13072   4.767 1.23e-05 ***
## WTANGER      0.32959    0.10087   3.267  0.00180 ** 
## WTFEAR      -0.22845    0.09378  -2.436  0.01784 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4244 on 60 degrees of freedom
## Multiple R-squared:  0.5096, Adjusted R-squared:  0.4851 
## F-statistic: 20.78 on 3 and 60 DF,  p-value: 2.362e-09

residuals <- resid(antiracismwithoutguilt.lm)
report(antiracismwithoutguilt.lm)

## We fitted a linear model (estimated using OLS) to predict ACCOUNT with SELFEFF,
## WTANGER and WTFEAR (formula: ACCOUNT ~ SELFEFF + WTANGER + WTFEAR). The model
## explains a statistically significant and substantial proportion of variance (R2
## = 0.51, F(3, 60) = 20.78, p < .001, adj. R2 = 0.49). The model's intercept,
## corresponding to SELFEFF = 0, WTANGER = 0 and WTFEAR = 0, is at 2.08 (95% CI
## [0.75, 3.40], t(60) = 3.13, p = 0.003). Within this model:
## 
##   - The effect of SELFEFF is statistically significant and positive (beta = 0.62,
## 95% CI [0.36, 0.88], t(60) = 4.77, p < .001; Std. beta = 0.47, 95% CI [0.27,
## 0.67])
##   - The effect of WTANGER is statistically significant and positive (beta = 0.33,
## 95% CI [0.13, 0.53], t(60) = 3.27, p = 0.002; Std. beta = 0.30, 95% CI [0.12,
## 0.49])
##   - The effect of WTFEAR is statistically significant and negative (beta = -0.23,
## 95% CI [-0.42, -0.04], t(60) = -2.44, p = 0.018; Std. beta = -0.23, 95% CI
## [-0.43, -0.04])
## 
## Standardized parameters were obtained by fitting the model on a standardized
## version of the dataset. 95% Confidence Intervals (CIs) and p-values were
## computed using a Wald t-distribution approximation.

Now look at the full suite of standard plots from the original basic LM including WTGUILT.

Question In the Q-Q plot of errors, the residuals on both tails are straying from the line, which could indicate that they are not normally distributed.

Learning resource:

—– from coaching D1: The Q-Q of residuals actually visually appears pretty normal (the points are not that far from the line). At the time of that consult, the calculations for the derived values had errors in their formulas which resulted in a low p-value on the Shapiro-Wilk test which seemed to indicate that the residuals were not normal, but this was due to those formula errors. When re-run on corrected formulas for derived values, the Shapiro-Wilk p-value > 0.05.

Coaching D1 also confirmed that it does not matter if the variables in the dataset are normal. The test for normality is only done on the residuals (errors) after running the regression.

Learning resource - cook’s distance and diagnostic plots in general

Check for Heteroscedasticity

In order for the model to be confirmed as a fit and that the results are trustworthy, then we need to evaluate that the data is homoscedastic.

Question In Scale-Location, the red line is not really horizontal (??). Is that a problem?

Maybe a regular (OLS) linear regression (that assumes normal residuals) is not correct since the DV isn’t really continuous, it is actually discrete. Therefore, try a Poisson regression (“In fact, a zero-inflated Poisson, negative binomial, or zero-inflated negative binomial are more likely what you will end up needing.” (https://stats.stackexchange.com/questions/103466/heteroscedasticity-in-residuals-vs-fitted-plot))

plot(antiracism.lm)

Examining the residuals plots for original attempted model

For the subset of data of only counselors and counselor educators, there are two observations that appear to be influential points: Observation #17 and #33 in the plots. On Residuals vs. Fitted, Observation #17 is clearly influencing the line. On Q-Q Residuals, Observation #17 is not even on the line. On Residuals vs. Leverage. Observation #17 is within Cook’s Distance. These plots indicate that this influential point is affecting the model.

#Create density plot of residuals
plot(density(residuals))

shapiro.test(residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals
## W = 0.95639, p-value = 0.02387

2/19/24 with closed study and final dataset - Ran the basic LM of all 4 IVs against DV for the entire dataset including outliers of 128 and Shapiro-Wilk is infintesimal of 0.000005 something. One point was within Cook’s distance on Residuals vs Leverage. Q-Q residuals did not look that bad though. Residuals vs Fitted was mostly a horizontal line.

Re-ran with the limited dataset of only counselor/counselor ed, none of the alternatives on the race field, and no extremely influentials, which gave a sample of 63. Shapiro-Wilk is 0.26. The Q-Q Residuals looks quite good.

When the influential point is included, Shapiro-Wilk is 0.027

Even with the influential point identified above, the Shapiro Wilk test indicates that the residuals from the sample follow a normal distribution.

The Shapiro Wilk test checks normality of the data. The null hypothesis for the Shapiro-Wilk test is that the residuals fit a normal distribution. If the p-value of the Shapiro Wilk test is less than .05, then the null hypothesis is rejected, which is saying that the sample does not come from a normal population. In this case, the p-value is > .05, so the null hypothesis is not rejected - so the residuals do fit a normal distribution.

If the null hypothesis (that the residuals are normal) on the Shapiro Wilk is rejected, then that would mean the residuals were not normal and the dataset was non-parametric (which is not the case here). Linear regression is for parametric data. Per coaching D1, there is not a straight substitution of a different analysis than LM for nonparametric data, so instead, bootstrapping would be an option to pursue. Not needed on the current dataset though.

Identify Outliers

Ran the code below with outliers2 on the 149 dataset subsetted to only couns/counsed, same obs id as above but not extreme in the selfeff; disabling this code in case it’s causing problems but keep the analysis

# https://rpkgs.datanovia.com/rstatix/index.html

# This adds 2 columns, one for is.outlier and one for is.extreme
# Not sure what it will do if I run the same identify_outliers function on another column - maybe add another pair of same-named columns?

# Create separate tibble of just the identified outliers
# coredataoutliers <- coredata %>%
#   identify_outliers(ACCOUNT)
# head(coredataoutliers)

# returns a new tibble with 2 columns
myoutliers <- coredata %>%
  identify_outliers(ACCOUNT)
head(myoutliers)

# Ran the code below on the 149 dataset subsetted to only couns/counsed, same obs id as above but not extreme in the selfeff; disabling this code in case it's causing problems but keep the analysis
# myoutliers2 <- coredata %>%
#   identify_outliers(SELFEFF)
# head(myoutliers2)

# identify the extremely influential per the rstatix function
# create new column extremelyinfluential to capture true/false
coredata <- coredata %>%
  mutate(extremelyinfluential = is_extreme(ACCOUNT))

# added Tues 2/6 but seemed to cause problems down-code?
# re-added 2/10
 coredata <- coredata %>%
   mutate(influential = is_outlier(ACCOUNT))
head(coredata)

# Needed to confirm that the observation is still intact- this shows it is (this line works but is not needed in reality)
#coredata[coredata$"Respondent ID" == "114516188154", ]

When run on counselors only (including counselor educators), the identify_outliers() function identifies two observations, with the one containing Respondent ID of 114516188154 identified as “extreme.” That “extreme” datapoint is Observation __.

Correlations using Kendall’s tau

3/15/24 Arcenis recommended using Kendall’s \(\tau\) to evaluate whether to include the outlier or not. This was done by doing correlation via cor.test using the Kendall method first on SELFEFF and ACCOUNT in the full dataset, and compare to the same variables in the dataset with the outlier omitted.

## Kendall on SELFEFF and ACCOUNT full dataset
cor.test(coredata$SELFEFF, coredata$ACCOUNT, method = "kendall")

## 
##  Kendall's rank correlation tau
## 
## data:  coredata$SELFEFF and coredata$ACCOUNT
## z = 4.4415, p-value = 8.934e-06
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##       tau 
## 0.4436783

kendallnooutlierself <- coredata %>%
  filter(extremelyinfluential == FALSE) %>%
  .$SELFEFF

kendallnooutlieraccount <- coredata %>%
  filter(extremelyinfluential == FALSE) %>%
  .$ACCOUNT

## Kendall on SELFEFF and ACCOUNT without the outlier
cor.test(kendallnooutlierself, kendallnooutlieraccount, method = "kendall")

## 
##  Kendall's rank correlation tau
## 
## data:  kendallnooutlierself and kendallnooutlieraccount
## z = 4.161, p-value = 3.169e-05
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##       tau 
## 0.4212329

Results show a slightly higher tau when the outlier is included (.444) compared to when outlier excluded (.421). Arcenis advised that a tau > .5 is very strong; anything above .2 is strong enough for publication.

For Arcenis: Does the higher Kendall’s tau result for the model with the outlier included provide further reinforcement of my decision to include the outlier? Also do you happen to have a citation that I can use for this? This is not a standard statistic in my field, I don’t think.

Researcher decisions

Even though the residuals are shown to be normal based on the Shapiro Wilk test above, the plots still are visually indicating relationships between the observations in the dataset that need further attention.

Aguinis et al. (2013) presented recommendations on how to handle outliers. They commented that studying the outliers is important.

Below is the same regression with diagnostics plots on the residuals on the dataset without Observation __.

The model_performance() output gives a lower AIC than the LM with the extremely influential included.

The Shapiro wilk shows it as not normal.

# This is brute force just deleting specific rows
# But don't do that as it messes up the observation numbering
# coredatanoextreme <- coredata[-c(17), ]

antiracism2.lm<-lm(ACCOUNT ~ WTFEAR + WTGUILT + WTANGER + SELFEFF, data = subset(coredata, extremelyinfluential == FALSE))

# from performance package
model_performance(antiracism2.lm)

summary(antiracism2.lm)

## 
## Call:
## lm(formula = ACCOUNT ~ WTFEAR + WTGUILT + WTANGER + SELFEFF, 
##     data = subset(coredata, extremelyinfluential == FALSE))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.43581 -0.23386  0.05376  0.27226  0.74150 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.81929    0.69752   4.042 0.000158 ***
## WTFEAR      -0.21743    0.08843  -2.459 0.016949 *  
## WTGUILT      0.06924    0.04599   1.506 0.137601    
## WTANGER      0.20147    0.10173   1.980 0.052409 .  
## SELFEFF      0.52666    0.12676   4.155 0.000108 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3953 on 58 degrees of freedom
## Multiple R-squared:  0.396,  Adjusted R-squared:  0.3543 
## F-statistic: 9.506 on 4 and 58 DF,  p-value: 5.581e-06

plot(antiracism2.lm)

residuals2 <- resid(antiracism2.lm)
shapiro.test(residuals2)

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals2
## W = 0.95656, p-value = 0.02608

Evaluation of data against model assumptions after removing extremely influential point

Residuals vs Fitted is showing how another point, Observation 32 (the datapoint identified as 33 in the prior plots, not sure why the index changed, probably because I lost the row of labels in one of my manipulations?), is also having an effect on the line, however clearly not as extreme as Observation 17. This observation will not be removed from the dataset.

The Scale-Location plot still has a slope however it is much more horizontal without the influential observation.

Q-Q Residuals meet the colloquial “fat pencil test” where a fat pencil overlaid on top would capture all the points.

The Shapiro-Wilk again shows a p-value > .05 which means the null hypothesis of the Shapiro-Wilk test is not rejected, which means that the sample comes from a population that is normally distributed.

Test for multicollinearity of the 3 white racial affect independent variables

Run the regression on ACCOUNT (excluding the influential point).

Get VIF values for just the 3 affect variables. A VIF close to 1 means not collinear.

Each of the three white racial affect variables are unique in what they are contributing to the regression.

library(car)

## Loading required package: carData

whiteracialaffectsonaccount.lm <- lm(ACCOUNT ~ SELFEFF + WTGUILT + WTFEAR + WTANGER, data = subset(coredata, extremelyinfluential == FALSE))

vif(whiteracialaffectsonaccount.lm)

##  SELFEFF  WTGUILT   WTFEAR  WTANGER 
## 1.101855 1.032837 1.111659 1.027722

A follow-up analysis looked at the Variance Inflation Factor (VIF) test statistic. This assessment examines whether variables in a model show collinearity, which can affect the results. The VIF values for all four independent variables are near 1, so multicollinearity was determined to not be present.

Simple model: Only SELFEFF on ACCOUNT

The extremely influential point is omitted.

The other three constructs of white racial affect are omitted.

The purpose of this section is to see if this simpler model (residuals from) fit a normal distribution. I hope to get a higher p-value from the Shapiro wilk test.

Even this basic LM is showing that the residuals are not normal - the dataset is not normal (I think)

OK so what this says is that the assumption of linearity is not here - the relationship between ACCOUNT and SELFEFF cannot be expressed by a straight line.

Is this because they each are interval datapoints???

Help from Devin:

Here’s what I know: 1. Doing this simple OLS regression gives not-normal residuals per Shapiro-Wilk, whether I do it on my full sample or on the full sample with extremely influential points omitted.

antiracism.lm<-lm(ACCOUNT ~ WTFEAR + WTGUILT + WTANGER + SELFEFF, data = coredata)

All of these variables are averages, computed from answers to Likert-style questions on instruments. All except for SELFEFF are 1-6, SELFEFF is 1-4. Does this mean they are interval data? We cannot actually know the distance between the points. If they are interval data, does OLS LM not work because of this?

Looking at the scatter plot just between ACCOUNT and SELFEFF shows that there really is no way to fit a line to this data. Is this what’s causing me problems?

Let’s please look at the <a href=“#spearmans-rhoSpearman’s rho together. I see that there are no statistically significant relationships with the emotional variables and ACCOUNT. HOWEVER: My original model proposed that perhaps SELFEFF was a mediating variable. Can I make any conclusions from the (admittedly non statistically significant) inverse relationships revealed between SELFEFF and the 3 white racial affects???
Look at scatter plot of ACCOUNT and SELFEFF Because they are both intervals (are they???), does that mean that I cannot do a LM? Does this somehow say that the dataset is not normal?
What if I calculated totals instead of means for the ACCOUNT and SELFEFF scores - these variables are derived from Likert style answers on the underlying instruments. It would still be interval data but it would be different - would it? Would I measure the exact-same relationships if looking at sum vs looking at average? I don’t really understand data.
Please help me with terminology. My study is specifically looking at counselors (which also includes counselor educators). However, when I did data collection, I recruited from a variety of listservs for mental health practitioners, including social workers, therapists, and psychologists. Therefore, my entire dataset has responses from people who professionally identify in all of these ways.

When I’m talking about the entire dataset, with counselors, therapists, etc., is that a different population than when I’m talking about the subset of data from only counselors?

# antiracismselfeffonly.lm<-lm(ACCOUNT ~ SELFEFF, data = subset(coredata, extremelyinfluential == FALSE))

antiracismselfeffonly.lm<-lm(ACCOUNT ~ SELFEFF, data = coredata)

model_performance(antiracismselfeffonly.lm)

summary(antiracismselfeffonly.lm)

## 
## Call:
## lm(formula = ACCOUNT ~ SELFEFF, data = coredata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.53781 -0.17516  0.06018  0.31018  0.81018 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.7338     0.4460   6.130 6.71e-08 ***
## SELFEFF       0.8187     0.1325   6.177 5.57e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4691 on 62 degrees of freedom
## Multiple R-squared:  0.381,  Adjusted R-squared:  0.371 
## F-statistic: 38.16 on 1 and 62 DF,  p-value: 5.573e-08

plot(antiracismselfeffonly.lm)

residualsselfeff <- resid(antiracismselfeffonly.lm)
shapiro.test(residualsselfeff)

## 
##  Shapiro-Wilk normality test
## 
## data:  residualsselfeff
## W = 0.92066, p-value = 0.0005295

Goodness of Fit: Examine R-squared

Cannot really examine this output since the data does not meet assumptions for a linear model.

When I removed the extremely influential outlier, the Multiple R-squared went down from close to 50% to under 40%.

This influential point is apparently a model fit outlier (Aguinis et al., 2013) as it affected R-squared. Figure out how to ggplot these both on the same graph.

Notes: Here is what Aguinis et al. (2013) says to do: “Report findings with and without either of the following approaches: -Respecify model -Remove outlier -Robust approaches (e.g., absolute deviation, least trimmed squares, M-estimation, Bayesian statistics):

***TO DO - LOOK AT THOSE ROBUST APPROACHES ABOVE.

# from https://statisticsglobe.com/plot-predicted-vs-actual-values-in-r

subcore <- coredata %>%
  filter(extremelyinfluential == FALSE)

# Create data for ggplot2
data_mod <- data.frame(Predicted = predict(antiracism2.lm),
                       Observed = subcore$ACCOUNT)

 # Draw plot using ggplot2 package
ggplot(data_mod,                                    
       aes(x = Predicted,
           y = Observed)) +
  geom_point() +
  geom_abline(intercept = 0,
              slope = 1,
              color = "red",
              linewidth = 2) +
  ggtitle("R-squared Fitted vs Observed - full data")

Non-parametric tests considered

One sample t-test

Per this websiteThe assumptions for the data when using this analysis are that the dependent variable be either interval or ratio (yes); the observations should be independent (yes), the DV should be normally distributed (yes mostly), and there should be no outliers (does not fully meet this). This is used to test a hypothesis: is the average value for the DV X? I don’t really care what the mean is though, I’m looking at relationships. I don’t think this is the right analysis to perform.

Spearman’s rho

Used for nonparametric correlation calculations.

# keeping snippet - from psych package - found simpler way below instead

corr1 <- cor2(coredata$ACCOUNT, coredata$SELFEFF, digits=2, use="pairwise", method = 'spearman', cor="cor", show=TRUE)

From the rstatix package.

corout <- coredata %>%
  cor_test(
    vars = "ACCOUNT",
    vars2 = c("SELFEFF", "WTGUILT", "WTANGER", "WTFEAR"),
  method = "spearman"
)

corout$p <- formatC(corout$p, digits=4)
corout

Based on these low p-values, there are many significant relationships between all four independent variables and ACCOUNT, though the only one with a larger correlation is SELFEFF.

corout <- coredata %>%
  cor_test(
    vars = "SELFEFF",
    vars2 = c("WTGUILT", "WTANGER", "WTFEAR"),
  method = "spearman"
)

corout$p <- formatC(corout$p, digits=4)
corout

What’s interesting here is, while there are no statistically significant relationships demonstrated, the direction of the calculated correlation is negative. This actually matches my hypothesized relationships in the model - I think.

(Pulled from manuscript - this is model assumptions stuff - placed here to preserve it, needs to be below the Model Assumptions header because this is where the lm is available)

And of course at the very end after I had done everything manually, I discovered this R function to report on it in one table.

Model	Normality (Shapiro-Wilk)	Homoscedasticity (Breusch-Pagan)	Autocorrelation of residuals (Durbin-Watson)	Diagnostic
ACCOUNT ~ SELFEFF + WTANGER + WTGUILT + WTFEAR	0.03	0.61	0.78	1.00

The p value of < 0.05 indicate that assumption is not met. The Diagnostic column indicates how many of the assumptions are violated. In this output, the Shapiro-Wilk test of normality again shows the p-value of < 0.05.

This output doesn’t go to Word nicely from the qmd files so not including it now - but definitely try to use this in the future! It is a much simpler way of evaluating everything that I spent so long doing manually!!

Throwing shit at the wall

I tried the iRegression package (installed it and looked through) but it’s using two alternative formulas with a min and a max, does not seem relevant at all.

Individual items of the A-RES regressed against ACCOUNT

antiracismselfeffQs.lm<-lm(ACCOUNT ~ ARESQ01 + ARESQ02 + ARESQ03 + ARESQ04, data = subset(coredata, extremelyinfluential == FALSE))


summary(antiracismselfeffQs.lm)

## 
## Call:
## lm(formula = ACCOUNT ~ ARESQ01 + ARESQ02 + ARESQ03 + ARESQ04, 
##     data = subset(coredata, extremelyinfluential == FALSE))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.41187 -0.18860  0.04657  0.26843  0.66576 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               5.54285    0.26752  20.719  < 2e-16 ***
## ARESQ01Agree             -0.23810    0.11458  -2.078  0.04248 *  
## ARESQ02Agree             -0.23275    0.13673  -1.702  0.09446 .  
## ARESQ02Disagree          -0.42756    0.21386  -1.999  0.05062 .  
## ARESQ03Agree             -0.01362    0.11686  -0.117  0.90762    
## ARESQ03Disagree          -0.14683    0.20863  -0.704  0.48460    
## ARESQ03Disagree strongly -1.35976    0.45918  -2.961  0.00455 ** 
## ARESQ04Disagree           0.23258    0.26775   0.869  0.38889    
## ARESQ04Disagree strongly  0.35177    0.24798   1.419  0.16178    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4007 on 54 degrees of freedom
## Multiple R-squared:  0.422,  Adjusted R-squared:  0.3364 
## F-statistic: 4.929 on 8 and 54 DF,  p-value: 0.000132

#residuals plots - can skip these unless it's a different dataset
plot(antiracismselfeffQs.lm)

## Warning: not plotting observations with leverage one:
##   35

residualsselfeffQs <- resid(antiracismselfeffQs.lm)
 shapiro.test(residualsselfeffQs)

## 
##  Shapiro-Wilk normality test
## 
## data:  residualsselfeffQs
## W = 0.94695, p-value = 0.008799

Individual items of the PCRW White Guilt subscale regressed against ACCOUNT

arwtguilt.lm<-lm(ACCOUNT ~ PCRWQ04 + PCRWQ07 + PCRWQ08 + PCRWQ12 + PCRWQ15, data = subset(coredata, extremelyinfluential == FALSE))


summary(arwtguilt.lm)

## 
## Call:
## lm(formula = ACCOUNT ~ PCRWQ04 + PCRWQ07 + PCRWQ08 + PCRWQ12 + 
##     PCRWQ15, data = subset(coredata, extremelyinfluential == 
##     FALSE))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7698 -0.1681 -0.0076  0.2337  0.9052 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 6.29198    0.49265  12.772 3.93e-15 ***
## PCRWQ04Moderately Agree    -0.12068    0.25762  -0.468  0.64221    
## PCRWQ04Slightly Agree       0.19762    0.30614   0.646  0.52256    
## PCRWQ04Slightly Disagree   -0.40432    0.32684  -1.237  0.22386    
## PCRWQ04Moderately Disagree  0.11409    0.31484   0.362  0.71914    
## PCRWQ04Strongly Disagree   -0.02370    0.33389  -0.071  0.94379    
## PCRWQ07Moderately Agree    -0.01286    0.25564  -0.050  0.96014    
## PCRWQ07Slightly Agree      -0.23741    0.29488  -0.805  0.42590    
## PCRWQ07Slightly Disagree   -0.22762    0.29888  -0.762  0.45115    
## PCRWQ07Moderately Disagree -0.48544    0.33545  -1.447  0.15628    
## PCRWQ07Strongly Disagree   -0.53988    0.34905  -1.547  0.13044    
## PCRWQ08Moderately Agree    -0.99932    0.46715  -2.139  0.03908 *  
## PCRWQ08Slightly Agree      -1.65232    0.56628  -2.918  0.00596 ** 
## PCRWQ08Slightly Disagree   -1.08514    0.50691  -2.141  0.03895 *  
## PCRWQ08Moderately Disagree -1.22786    0.52914  -2.321  0.02593 *  
## PCRWQ08Strongly Disagree   -1.19611    0.53130  -2.251  0.03039 *  
## PCRWQ12Moderately Agree     0.84870    0.43385   1.956  0.05802 .  
## PCRWQ12Slightly Agree       1.12917    0.39599   2.852  0.00708 ** 
## PCRWQ12Slightly Disagree    0.96254    0.42528   2.263  0.02957 *  
## PCRWQ12Moderately Disagree  0.83788    0.42333   1.979  0.05526 .  
## PCRWQ12Strongly Disagree    0.84120    0.45147   1.863  0.07038 .  
## PCRWQ15Moderately Agree    -0.45247    0.27993  -1.616  0.11452    
## PCRWQ15Slightly Agree      -0.31772    0.27501  -1.155  0.25538    
## PCRWQ15Slightly Disagree   -0.27565    0.30843  -0.894  0.37725    
## PCRWQ15Moderately Disagree -0.01853    0.28554  -0.065  0.94860    
## PCRWQ15Strongly Disagree   -0.25358    0.30154  -0.841  0.40578    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4528 on 37 degrees of freedom
## Multiple R-squared:  0.4944, Adjusted R-squared:  0.1528 
## F-statistic: 1.447 on 25 and 37 DF,  p-value: 0.1505

plot(arwtguilt.lm)

residualswtg <- resid(arwtguilt.lm)
 shapiro.test(residualswtg)

## 
##  Shapiro-Wilk normality test
## 
## data:  residualswtg
## W = 0.98608, p-value = 0.6978

Individual items of the PCRW White Fear subscale regressed against ACCOUNT

arwtfear.lm<-lm(ACCOUNT ~ PCRWQ02 + PCRWQ05 + PCRWQ09 + PCRWQ11 + PCRWQ13, data = subset(coredata, extremelyinfluential == FALSE))


summary(arwtfear.lm)

## 
## Call:
## lm(formula = ACCOUNT ~ PCRWQ02 + PCRWQ05 + PCRWQ09 + PCRWQ11 + 
##     PCRWQ13, data = subset(coredata, extremelyinfluential == 
##     FALSE))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0133 -0.2156  0.0000  0.2392  0.5220 
## 
## Coefficients: (1 not defined because of singularities)
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 4.487366   0.641296   6.997  1.3e-08 ***
## PCRWQ02Moderately Agree    -0.451594   0.237889  -1.898  0.06438 .  
## PCRWQ02Slightly Agree      -0.466731   0.241747  -1.931  0.06014 .  
## PCRWQ02Slightly Disagree   -0.177519   0.284487  -0.624  0.53593    
## PCRWQ02Moderately Disagree -0.536264   0.268273  -1.999  0.05196 .  
## PCRWQ02Strongly Disagree   -1.412104   0.416077  -3.394  0.00149 ** 
## PCRWQ05Moderately Agree     1.059297   0.533395   1.986  0.05344 .  
## PCRWQ05Slightly Agree       0.228788   0.344704   0.664  0.51041    
## PCRWQ05Slightly Disagree    0.007781   0.307558   0.025  0.97993    
## PCRWQ05Moderately Disagree -0.059705   0.339573  -0.176  0.86126    
## PCRWQ05Strongly Disagree    0.055130   0.306129   0.180  0.85793    
## PCRWQ09Slightly Disagree   -0.715113   0.629977  -1.135  0.26260    
## PCRWQ09Moderately Disagree -0.545595   0.477719  -1.142  0.25974    
## PCRWQ09Strongly Disagree   -0.285772   0.427783  -0.668  0.50768    
## PCRWQ11Slightly Agree      -0.141294   0.811773  -0.174  0.86264    
## PCRWQ11Slightly Disagree   -1.280774   1.172059  -1.093  0.28059    
## PCRWQ11Moderately Disagree  0.001476   1.031874   0.001  0.99887    
## PCRWQ11Strongly Disagree   -0.066097   1.002196  -0.066  0.94772    
## PCRWQ13Slightly Agree             NA         NA      NA       NA    
## PCRWQ13Moderately Disagree  1.797549   0.883728   2.034  0.04814 *  
## PCRWQ13Strongly Disagree    1.823666   0.859564   2.122  0.03967 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4043 on 43 degrees of freedom
## Multiple R-squared:  0.5315, Adjusted R-squared:  0.3245 
## F-statistic: 2.567 on 19 and 43 DF,  p-value: 0.005245

plot(arwtfear.lm)

## Warning: not plotting observations with leverage one:
##   12, 18, 35, 39, 58

residualswtf <- resid(arwtfear.lm)
 shapiro.test(residualswtf)

## 
##  Shapiro-Wilk normality test
## 
## data:  residualswtf
## W = 0.95653, p-value = 0.02598

Individual items of the PCRW White Anger subscale regressed against ACCOUNT

arwtanger.lm<-lm(ACCOUNT ~ PCRWQ01 + PCRWQ03 + PCRWQ06 + PCRWQ10 + PCRWQ14 + PCRWQ16, data = subset(coredata, extremelyinfluential == FALSE))


summary(arwtanger.lm)

## 
## Call:
## lm(formula = ACCOUNT ~ PCRWQ01 + PCRWQ03 + PCRWQ06 + PCRWQ10 + 
##     PCRWQ14 + PCRWQ16, data = subset(coredata, extremelyinfluential == 
##     FALSE))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.24238 -0.10946  0.00051  0.18010  0.81706 
## 
## Coefficients: (1 not defined because of singularities)
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 5.492382   0.126659  43.364  < 2e-16 ***
## PCRWQ01Moderately Agree    -0.049882   0.140558  -0.355 0.724540    
## PCRWQ01Slightly Agree      -1.282855   0.299505  -4.283 0.000112 ***
## PCRWQ01Slightly Disagree   -0.645132   0.439605  -1.468 0.150052    
## PCRWQ01Moderately Disagree -0.343116   0.703278  -0.488 0.628298    
## PCRWQ03Moderately Agree     0.279691   0.164149   1.704 0.096160 .  
## PCRWQ03Slightly Agree       0.231272   0.173110   1.336 0.189108    
## PCRWQ03Slightly Disagree    0.340439   0.220769   1.542 0.130933    
## PCRWQ03Moderately Disagree  0.391191   0.227441   1.720 0.093167 .  
## PCRWQ03Strongly Disagree    0.279873   0.368582   0.759 0.452111    
## PCRWQ06Moderately Agree    -0.192020   0.142417  -1.348 0.185149    
## PCRWQ06Slightly Agree       0.049148   0.366575   0.134 0.894018    
## PCRWQ06Slightly Disagree          NA         NA      NA       NA    
## PCRWQ10Moderately Agree    -0.121756   0.195005  -0.624 0.535929    
## PCRWQ10Slightly Agree       0.704609   0.479662   1.469 0.149662    
## PCRWQ14Moderately Agree     0.004812   0.186062   0.026 0.979495    
## PCRWQ14Slightly Agree       0.173311   0.202413   0.856 0.396972    
## PCRWQ14Slightly Disagree    0.198846   0.351501   0.566 0.574754    
## PCRWQ14Moderately Disagree -0.359996   0.454683  -0.792 0.433176    
## PCRWQ14Strongly Disagree    0.276346   0.417498   0.662 0.511824    
## PCRWQ16Moderately Agree    -0.545528   0.217339  -2.510 0.016219 *  
## PCRWQ16Slightly Agree      -0.136832   0.370480  -0.369 0.713824    
## PCRWQ16Moderately Disagree -1.301692   0.685464  -1.899 0.064795 .  
## PCRWQ16Strongly Disagree   -1.056518   0.609491  -1.733 0.090721 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3943 on 40 degrees of freedom
## Multiple R-squared:  0.5856, Adjusted R-squared:  0.3576 
## F-statistic: 2.569 on 22 and 40 DF,  p-value: 0.004661

plot(arwtanger.lm)

## Warning: not plotting observations with leverage one:
##   8, 13, 17, 19, 41, 54, 57

residualswta <- resid(arwtanger.lm)
 shapiro.test(residualswta)

## 
##  Shapiro-Wilk normality test
## 
## data:  residualswta
## W = 0.93244, p-value = 0.001887

Quantile Regression

https://www.youtube.com/watch?v=Gtz8ca_4hVg

Wed morning made eval false here, this code is not running

library(quantreg)
#plot ordinary and median regressions

ggplot(coredata, aes(SELFEFF, ACCOUNT)) +
  geom_point() +
  geom_smooth(method = lm, se = F, color = "red", ) +
  geom_quantile(quantiles = 0.5)

lr <- lm(ACCOUNT ~ SELFEFF, data = coredata)
check_heteroscedasticity(lr)

qm50 <- rq(ACCOUNT ~ SELFEFF, data = subset(coredata, extremelyinfluential == FALSE), tau = 0.5)
AIC(lr, qm50)

check_outliers(lr)
check_normality(lr)
#check_homogeneity(lr)


# multivariable regression
q <- rq(ACCOUNT ~ SELFEFF + WTANGER, data = subset(coredata, extremelyinfluential == FALSE), 
        tau = seq(.05, .95, by = 0.05))

summary(q)  %>% 
   plot(c("SELFEFF", "WTANGER"))


library(quantregGrowth)
set.seed(1)
o <-gcrq(ACCOUNT ~ ps(SELFEFF), 
         data = coredata %>% sample_n(100), tau=seq(.10,.90,l=3))

# par(mfrow=c(1,2)) # for several plots
plot(o, legend=TRUE, conf.level = .95, shade=TRUE, lty = 1, lwd = 3, col = -1, res=TRUE)

Nonparametric regression (fancy stuff)

From https://socialsciences.mcmaster.ca/jfox/Books/Companion/appendices/Appendix-Nonparametric-Regression.pdf

# nonparametric multiple regression 
mod.lo <- loess(ACCOUNT ~ SELFEFF + WTANGER, span=.5, degree=1,  data = subset(coredata, extremelyinfluential == FALSE))

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 11.431 15.903

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 1.633

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 1.3479e-19

summary(mod.lo)

## Call:
## loess(formula = ACCOUNT ~ SELFEFF + WTANGER, data = subset(coredata, 
##     extremelyinfluential == FALSE), span = 0.5, degree = 1)
## 
## Number of Observations: 63 
## Equivalent Number of Parameters: 7.77 
## Residual Standard Error: 0.3778 
## Trace of smoother matrix: 10.19  (exact)
## 
## Control settings:
##   span     :  0.5 
##   degree   :  1 
##   family   :  gaussian
##   surface  :  interpolate      cell = 0.2
##   normalize:  TRUE
##  parametric:  FALSE FALSE
## drop.square:  FALSE FALSE

selfe <- with(subset(coredata, extremelyinfluential == FALSE), seq(min(SELFEFF), max(SELFEFF), len=25))
wtang <- with(subset(coredata, extremelyinfluential == FALSE), seq(min(WTANGER), max(WTANGER), len=25))
newdata <- expand.grid(SELFEFF=selfe, WTANGER=wtang)
fit.account <- matrix(predict(mod.lo, newdata), 25, 25)

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : eval 8.165 11.084

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : lowerlimit 8.1405
## 11.431

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : extrapolation not
## allowed with blending

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : eval 8.3691 11.084

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : lowerlimit 8.1405
## 11.431

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : extrapolation not
## allowed with blending

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : eval 8.5732 11.084

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : lowerlimit 8.1405
## 11.431

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : extrapolation not
## allowed with blending

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : eval 8.7773 11.084

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : lowerlimit 8.1405
## 11.431

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : extrapolation not
## allowed with blending

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : eval 8.9815 11.084

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : lowerlimit 8.1405
## 11.431

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : extrapolation not
## allowed with blending

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : eval 9.1856 11.084

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : lowerlimit 8.1405
## 11.431

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : extrapolation not
## allowed with blending

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : eval 9.3897 11.084

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : lowerlimit 8.1405
## 11.431

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : extrapolation not
## allowed with blending

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : eval 9.5938 11.084

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : lowerlimit 8.1405
## 11.431

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : extrapolation not
## allowed with blending

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : eval 9.798 11.084

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : lowerlimit 8.1405
## 11.431

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : extrapolation not
## allowed with blending

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : eval 8.165 11.345

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : lowerlimit 8.1405
## 11.431

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : extrapolation not
## allowed with blending

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : eval 8.3691 11.345

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : lowerlimit 8.1405
## 11.431

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : extrapolation not
## allowed with blending

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : eval 8.5732 11.345

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : lowerlimit 8.1405
## 11.431

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : extrapolation not
## allowed with blending

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : eval 8.7773 11.345

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : lowerlimit 8.1405
## 11.431

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : extrapolation not
## allowed with blending

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : eval 8.9815 11.345

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : lowerlimit 8.1405
## 11.431

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : extrapolation not
## allowed with blending

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : eval 9.1856 11.345

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : lowerlimit 8.1405
## 11.431

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : extrapolation not
## allowed with blending

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : eval 9.3897 11.345

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : lowerlimit 8.1405
## 11.431

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : extrapolation not
## allowed with blending

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : eval 9.5938 11.345

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : lowerlimit 8.1405
## 11.431

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : extrapolation not
## allowed with blending

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : eval 9.798 11.345

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : lowerlimit 8.1405
## 11.431

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : extrapolation not
## allowed with blending

persp(selfe, wtang, fit.account, theta=45, phi=30, ticktype="detailed",
xlab="SELFEFF", ylab="WTANGER", zlab="ACCOUNT", expand=2/3,
shade=0.5)

GLMULTI package

library(glmulti)

# This did not work, the code runs but cannot get the output to display. 


glmulti.output <- glmulti(ACCOUNT   ~ SELFEFF + WTGUILT + WTANGER + WTFEAR,
        data   = subset(coredata, extremelyinfluential == FALSE), 
        crit   = aicc,       # AICC corrected AIC for small samples
        level  = 2,          # 2 with interactions, 1 without  
        method = "d",        # "d", or "h", or "g" for genetic which is v slow
        family = gaussian, 
        plotty = TRUE,
        fitfunction = glm,   # Type of model (LM, GLM etc.)
        confsetsize = 100)   # Keep 100 best models

# optimal_model_glmulti_d <- glmulti.output@objects[[1]]
# performance(optimal_model_glmulti_d)
# optimal_model_glmulti_d$formula

print(glmulti.output)
summary(glmulti.output)

USELESS!

Nadaray Watson kernel function

My understanding so far

The low p-value for self efficacy correlated to accountability means the null hypothesis should be rejected; the correlation between these two constructs is stastically significant _if the dataset were normal, but it’s not__.

And, at least in this subset of the full dataset, the null hypothesis cannot be rejected about any other relationships.

My proposed model said that the three white racial affects (emotions) influence antiracist self-efficacy, which influences antiracist accountability. (I think this means that SELFEFF is a mediator variable - correct? Does that determine how the regression should be set up - or even that path analysis is needed?)

The Multiple R-squared calculated in the regression of all four IVs against ACCOUNT in the counselors only dataset is: 0.3789

This means that about 38% of the variation in ACCOUNT is explained by the variables. This is considered generally to be a very low effect size; however, this post by a statistician indicates that when studying humans, R squared values are often < .5; need to find an academic citation for this.

When I run only SELFEFF against ACCOUNT, the R-squared goes down.

antiracism3.lm<-lm(ACCOUNT ~ SELFEFF + WTGUILT, data = subset(coredata, extremelyinfluential == FALSE))


summary(antiracism3.lm)

## 
## Call:
## lm(formula = ACCOUNT ~ SELFEFF + WTGUILT, data = subset(coredata, 
##     extremelyinfluential == FALSE))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.34037 -0.16587  0.07884  0.23893  0.68828 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.13065    0.47950   6.529 1.58e-08 ***
## SELFEFF      0.63564    0.12816   4.960 6.12e-06 ***
## WTGUILT      0.06528    0.04802   1.359    0.179    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4185 on 60 degrees of freedom
## Multiple R-squared:  0.2995, Adjusted R-squared:  0.2762 
## F-statistic: 12.83 on 2 and 60 DF,  p-value: 2.301e-05

#residuals plots - can skip these unless it's a different dataset
#plot(antiracism3.lm)

# residuals3 <- resid(antiracism3.lm)
# shapiro.test(residuals3)

# Create data for ggplot2
data_mod <- data.frame(Predicted = predict(antiracism3.lm),
                       Observed = subcore$ACCOUNT)

 # Draw plot using ggplot2 package
ggplot(data_mod,                                    
       aes(x = Predicted,
           y = Observed)) +
  geom_point() +
  geom_abline(intercept = 0,
              slope = 1,
              color = "red",
              linewidth = 2) +
  ggtitle("R-squared Fitted vs Observed - only SELFEFF")

Let’s look at just the white racial affects on self-efficacy as a DV.

Linear regression of the white racial affects on SELFEFF

This is the standard LM function in R with with SELFEFF as DV, excluding the extremely influential observation.

The p-values show only WTFEAR having a (negative) effect on SELFEFF.

The Residuals vs Fitted does not look at all like a cone here.

But what’s going on with the Q-Q Residuals??

#antiracistselfeff.lm<-lm(SELFEFF ~ WTANGER + WTGUILT + WTFEAR, data = subset(coredata, extremelyinfluential == FALSE))
antiracistselfeff.lm<-lm(SELFEFF ~ WTFEAR + WTGUILT + WTANGER, data = coredata)
summary(antiracistselfeff.lm)

## 
## Call:
## lm(formula = SELFEFF ~ WTFEAR + WTGUILT + WTANGER, data = coredata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.79186 -0.35362  0.02283  0.21280  0.87491 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.8170242  0.5487552   5.133 3.24e-06 ***
## WTFEAR      -0.2384423  0.0878873  -2.713  0.00869 ** 
## WTGUILT     -0.0004311  0.0482747  -0.009  0.99291    
## WTANGER      0.1838356  0.0982087   1.872  0.06610 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4192 on 60 degrees of freedom
## Multiple R-squared:  0.1585, Adjusted R-squared:  0.1165 
## F-statistic: 3.768 on 3 and 60 DF,  p-value: 0.01516

plot(antiracistselfeff.lm)

residualsselfeff <- resid(antiracistselfeff.lm)
model_performance((antiracistselfeff.lm))

report(antiracistselfeff.lm)

## We fitted a linear model (estimated using OLS) to predict SELFEFF with WTFEAR,
## WTGUILT and WTANGER (formula: SELFEFF ~ WTFEAR + WTGUILT + WTANGER). The model
## explains a statistically significant and moderate proportion of variance (R2 =
## 0.16, F(3, 60) = 3.77, p = 0.015, adj. R2 = 0.12). The model's intercept,
## corresponding to WTFEAR = 0, WTGUILT = 0 and WTANGER = 0, is at 2.82 (95% CI
## [1.72, 3.91], t(60) = 5.13, p < .001). Within this model:
## 
##   - The effect of WTFEAR is statistically significant and negative (beta = -0.24,
## 95% CI [-0.41, -0.06], t(60) = -2.71, p = 0.009; Std. beta = -0.32, 95% CI
## [-0.56, -0.08])
##   - The effect of WTGUILT is statistically non-significant and negative (beta =
## -4.31e-04, 95% CI [-0.10, 0.10], t(60) = -8.93e-03, p = 0.993; Std. beta =
## -1.08e-03, 95% CI [-0.24, 0.24])
##   - The effect of WTANGER is statistically non-significant and positive (beta =
## 0.18, 95% CI [-0.01, 0.38], t(60) = 1.87, p = 0.066; Std. beta = 0.23, 95% CI
## [-0.02, 0.47])
## 
## Standardized parameters were obtained by fitting the model on a standardized
## version of the dataset. 95% Confidence Intervals (CIs) and p-values were
## computed using a Wald t-distribution approximation.

shapiro.test(residualsselfeff)

## 
##  Shapiro-Wilk normality test
## 
## data:  residualsselfeff
## W = 0.96818, p-value = 0.09721

antiracistselfeff.lm<-lm(SELFEFF ~ WTFEAR + WTGUILT + WTANGER, data = coredata)
summary(antiracistselfeff.lm)

## 
## Call:
## lm(formula = SELFEFF ~ WTFEAR + WTGUILT + WTANGER, data = coredata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.79186 -0.35362  0.02283  0.21280  0.87491 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.8170242  0.5487552   5.133 3.24e-06 ***
## WTFEAR      -0.2384423  0.0878873  -2.713  0.00869 ** 
## WTGUILT     -0.0004311  0.0482747  -0.009  0.99291    
## WTANGER      0.1838356  0.0982087   1.872  0.06610 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4192 on 60 degrees of freedom
## Multiple R-squared:  0.1585, Adjusted R-squared:  0.1165 
## F-statistic: 3.768 on 3 and 60 DF,  p-value: 0.01516

plot(antiracistselfeff.lm)

residualsselfeff <- resid(antiracistselfeff.lm)
model_performance((antiracistselfeff.lm))

report(antiracistselfeff.lm)

## We fitted a linear model (estimated using OLS) to predict SELFEFF with WTFEAR,
## WTGUILT and WTANGER (formula: SELFEFF ~ WTFEAR + WTGUILT + WTANGER). The model
## explains a statistically significant and moderate proportion of variance (R2 =
## 0.16, F(3, 60) = 3.77, p = 0.015, adj. R2 = 0.12). The model's intercept,
## corresponding to WTFEAR = 0, WTGUILT = 0 and WTANGER = 0, is at 2.82 (95% CI
## [1.72, 3.91], t(60) = 5.13, p < .001). Within this model:
## 
##   - The effect of WTFEAR is statistically significant and negative (beta = -0.24,
## 95% CI [-0.41, -0.06], t(60) = -2.71, p = 0.009; Std. beta = -0.32, 95% CI
## [-0.56, -0.08])
##   - The effect of WTGUILT is statistically non-significant and negative (beta =
## -4.31e-04, 95% CI [-0.10, 0.10], t(60) = -8.93e-03, p = 0.993; Std. beta =
## -1.08e-03, 95% CI [-0.24, 0.24])
##   - The effect of WTANGER is statistically non-significant and positive (beta =
## 0.18, 95% CI [-0.01, 0.38], t(60) = 1.87, p = 0.066; Std. beta = 0.23, 95% CI
## [-0.02, 0.47])
## 
## Standardized parameters were obtained by fitting the model on a standardized
## version of the dataset. 95% Confidence Intervals (CIs) and p-values were
## computed using a Wald t-distribution approximation.

shapiro.test(residualsselfeff)

## 
##  Shapiro-Wilk normality test
## 
## data:  residualsselfeff
## W = 0.96818, p-value = 0.09721

Let’s look at just the white racial affects on ACCOUNT.

Linear regression of the white racial affects on ACCOUNT

#antiracistaffectsonly.lm<-lm(ACCOUNT ~ WTANGER + WTGUILT + WTFEAR, data = subset(coredata, extremelyinfluential == FALSE))

antiracistaffectsonly.lm<-lm(ACCOUNT ~ WTANGER + WTGUILT + WTFEAR, data = coredata)
summary(antiracistaffectsonly.lm)

## 
## Call:
## lm(formula = ACCOUNT ~ WTANGER + WTGUILT + WTFEAR, data = coredata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.45991 -0.23845  0.01118  0.33163  0.88462 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.70406    0.63989   5.789 2.76e-07 ***
## WTANGER      0.41375    0.11452   3.613 0.000620 ***
## WTGUILT      0.08680    0.05629   1.542 0.128351    
## WTFEAR      -0.39457    0.10248  -3.850 0.000289 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4888 on 60 degrees of freedom
## Multiple R-squared:  0.3496, Adjusted R-squared:  0.3171 
## F-statistic: 10.75 on 3 and 60 DF,  p-value: 9.448e-06

plot(antiracistaffectsonly.lm)

Linear regression of ACCOUNT with just SELFEFF and WTFEAR

#antiracistaffectsonly.lm<-lm(ACCOUNT ~ WTFEAR + WTGUILT + WTANGER, data = subset(coredata, extremelyinfluential == FALSE))

realmodelmaybe.lm<-lm(ACCOUNT ~ SELFEFF + WTFEAR + WTANGER, data = coredata)
summary(realmodelmaybe.lm)

## 
## Call:
## lm(formula = ACCOUNT ~ SELFEFF + WTFEAR + WTANGER, data = coredata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.52698 -0.20677  0.03187  0.29712  0.80873 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.07535    0.66275   3.131  0.00269 ** 
## SELFEFF      0.62314    0.13072   4.767 1.23e-05 ***
## WTFEAR      -0.22845    0.09378  -2.436  0.01784 *  
## WTANGER      0.32959    0.10087   3.267  0.00180 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4244 on 60 degrees of freedom
## Multiple R-squared:  0.5096, Adjusted R-squared:  0.4851 
## F-statistic: 20.78 on 3 and 60 DF,  p-value: 2.362e-09

plot(realmodelmaybe.lm)

Variable Selection

Sequential Estimation

Per Dr Patrick conversation 3/6/24: Selection Regression? Sequential/ hierarchical / clockwise? I investigated R seqest package, to understand the effects from the emotion variables to SELFEFF, then to ACCOUNT. (The docs for this package very hard to understand.) It has a seq_cat_model() function for sequential logistic regression on categorical predictors, which is not my data. Also seq_GEE_model() function for Generalized Estimating Equations, which is either correlated data, or data that is grouped, which also does not seem to be a match for my data (??).

Discuss with Arcenis.

Also looked at ase_seq_logit() which sounded perfect: Variable selection and stopping criterion in logistic regression model. This sounded exactlylike what Dr. Patrick was suggesting. However my data is not binary so logistic regression not appropriate (??). And, I then realized that it’s not actually showing mediation. It’s just letting me confirm that all of the variables are legitimate to include in the regression, because they indicate direct effect on ACCOUNT. I actually want to be testing for indirect effect though (see use of mediation package below, which I think I cannot use because of small sample?).

The code below from leaps package reinforces the same info that I got from stepwise regression, and confirms that all of the independent variables belong in the regression model with the DV ACCOUNT. However, this does not actually tell me about mediating role of SELFEFF, it just says that all variables are relevant when regressed on ACCOUNT.

#from https://advstats.psychstat.org/book/mregression/selection.php

library(leaps)

allvars <- regsubsets(ACCOUNT ~ SELFEFF + WTANGER + WTGUILT + WTFEAR, data = coredata, nbest=1, nvmax=5)
info <- summary(allvars)

cbind(info$which, round(cbind(rsq=info$rsq, adjr2=info$adjr2, cp=info$cp, bic=info$bic, rss=info$rss), 3))

##   (Intercept) SELFEFF WTANGER WTGUILT WTFEAR   rsq adjr2     cp     bic    rss
## 1           1       1       0       0      0 0.381 0.371 18.632 -22.377 13.643
## 2           1       1       1       0      0 0.461 0.443 10.455 -27.089 11.877
## 3           1       1       1       0      1 0.510 0.485  6.294 -28.966 10.808
## 4           1       1       1       1      1 0.536 0.504  5.000 -28.284 10.237

#plot(2:8, info$cp, xlab='P (# of predictors + 1)', ylab='Cp')
#abline(a=0,b=1)

Stepwise Regression

This type of regression is used to figure out which of the variables actually contribute to the dependent variable. A stepwise procedure

First is a Forward Stepwise Selection AIC is Akaike Information Criterion which is a goodness-of-fit measure, when used to compare to another model (a sole AIC score independently is meaningless). A smaller AIC is preferable to larger.

#define intercept-only model
#intercept_only <- lm(ACCOUNT ~ 1, data = subset(coredata, extremelyinfluential == FALSE))

intercept_only <- lm(ACCOUNT ~ 1, data = coredata)

#define model with all predictors
#all <- lm(ACCOUNT ~ WTFEAR + WTGUILT + WTANGER + SELFEFF, data = subset(coredata, extremelyinfluential == FALSE))

all <- lm(ACCOUNT ~ WTFEAR + WTGUILT + WTANGER + SELFEFF, data = coredata)

#perform forward stepwise regression
#trace=0 will suppress the full results of the stepwise to save space in the output
forward <- step(intercept_only, direction='forward', scope=formula(all), trace=0)
#forward <- step(intercept_only, direction='forward', scope=formula(all))


#view results of forward stepwise regression
forward$anova

#view final model
forward$coefficients

## (Intercept)     SELFEFF     WTANGER      WTFEAR     WTGUILT 
##  1.94789385  0.62341231  0.29914801 -0.24592658  0.08706655

(When the full dataset of all professional identities is included in this analysis (not just counselors), the AIC scores have much greater variability between the additional three variables, which is interesting and probably deserves further exploration - though, might only be an aspect of the larger dataset.)

Let’s do the same thing with a backwards stepwise selection:

#define intercept-only model
intercept_only <- lm(ACCOUNT ~ 1, data = coredata)

#define model with all predictors
all <- lm(ACCOUNT ~ WTFEAR + WTGUILT + WTANGER + SELFEFF, data = coredata)

#perform both-directions stepwise regression
backwardsw <- step(intercept_only, direction='backward', scope=formula(all), trace=0)

#view results of forward stepwise regression
backwardsw$anova

#view final model
backwardsw$coefficients

## (Intercept) 
##    5.464844

Partial Mediation Model

Test for mediation:

Show that ‘WTANGER’, ‘WTGUILT’, and ‘WTFEAR’ are associated with ‘ACCOUNT’ (without ‘SELFEFF’ in the model).

summary(antiracistaffectsonly.lm)

## 
## Call:
## lm(formula = ACCOUNT ~ WTANGER + WTGUILT + WTFEAR, data = coredata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.45991 -0.23845  0.01118  0.33163  0.88462 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.70406    0.63989   5.789 2.76e-07 ***
## WTANGER      0.41375    0.11452   3.613 0.000620 ***
## WTGUILT      0.08680    0.05629   1.542 0.128351    
## WTFEAR      -0.39457    0.10248  -3.850 0.000289 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4888 on 60 degrees of freedom
## Multiple R-squared:  0.3496, Adjusted R-squared:  0.3171 
## F-statistic: 10.75 on 3 and 60 DF,  p-value: 9.448e-06

WTGUILT p-value not less than .05, so no significant evidence that it is associated with ACCOUNT when SELFEFF not in the model, and the effect size is small anyway.

Show that ‘WTANGER’, ‘WTGUILT’, and ‘WTFEAR’ are associated with ‘SELFEFF’.

summary(antiracistselfeff.lm)

## 
## Call:
## lm(formula = SELFEFF ~ WTFEAR + WTGUILT + WTANGER, data = coredata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.79186 -0.35362  0.02283  0.21280  0.87491 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.8170242  0.5487552   5.133 3.24e-06 ***
## WTFEAR      -0.2384423  0.0878873  -2.713  0.00869 ** 
## WTGUILT     -0.0004311  0.0482747  -0.009  0.99291    
## WTANGER      0.1838356  0.0982087   1.872  0.06610 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4192 on 60 degrees of freedom
## Multiple R-squared:  0.1585, Adjusted R-squared:  0.1165 
## F-statistic: 3.768 on 3 and 60 DF,  p-value: 0.01516

Only WTFEAR is has a statistically significant effect, with moderate effect size of -0.24. WTANGER is approaching significance, with a smaller effect of 0.18. WTGUILT cannot be said to have a significant effect, with p-value approaching 1, and the beta is tiny. The adjusted R squared of this model is quite low, at .116. It is not established that there is a relationship between WTGUILT and SELFEFF. HOW DOES THIS AFFECT MY MEDIATION ANALYSIS?

Show that ‘SELFEFF’ is associated with ‘ACCOUNT’ when ‘WTANGER’, ‘WTGUILT’, and ‘WTFEAR’ are also in the model.

summary(antiracism.lm)

## 
## Call:
## lm(formula = ACCOUNT ~ SELFEFF + WTANGER + WTGUILT + WTFEAR, 
##     data = coredata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.49631 -0.22641  0.03488  0.31135  0.73861 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.94789    0.65421   2.977  0.00421 ** 
## SELFEFF      0.62341    0.12829   4.859 9.07e-06 ***
## WTANGER      0.29915    0.10040   2.979  0.00419 ** 
## WTGUILT      0.08707    0.04797   1.815  0.07462 .  
## WTFEAR      -0.24593    0.09254  -2.657  0.01011 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4165 on 59 degrees of freedom
## Multiple R-squared:  0.5355, Adjusted R-squared:  0.504 
## F-statistic: 17.01 on 4 and 59 DF,  p-value: 2.515e-09

All 4 of the IVs are shown to be associated to ACCOUNT. So suddenly here, WTGUILT shows up (though p-value is .07 which is higher than my .05 alpha).

If ‘SELFEFF’ is a mediator, the relationships between ‘WTANGER’, ‘WTGUILT’, ‘WTFEAR’ and ‘ACCOUNT’ should become less significant when ‘SELFEFF’ is included in the model.

WTANGER and WTFEAR p-values increase when SELFEFF is in the model.

WITH SELFEFF WTANGER 0.29915 WTFEAR -0.24593

WITHOUT SELFEFF WTANGER 0.41375 WTFEAR -0.39457

For both of those, the coefficients are lower when SELFEFF is in the model, which implies that SELFEFF is a mediator on both.

With SELFEFF in the model, WTANGER is lower at 0.299 (compared to 0.41375 without SELFEFF), and p-value increases but still significant. WTFEAR also decreases (from -0.24593 with SELFEFF, to -0.39457 without) and p-value also increases but still significant. This suggests that SELFEFF is mediating the relationship between these variables and ACCOUNT.

A mediator variable is one that explains the how or why of a observed relationship between an independent variable and dependent variable. So, in this case, SELFEFF appears to be explaining part of the relationship between WTANGER and WTFEAR and ACCOUNT.

WITH SELFEFF

WITHOUT SELFEFF WTGUILT .086 and p-value .13

WITH SELFEFF WTGUILT 0.08707 and p-value = .07 which is approaching significance.

WTGUILT is somewhat different: When SELFEFF is in the model, the relationship of WTGUILT on ACCOUNT improves slightly. However, WTGUILT has no association with SELFEFF when they are examined separately. The predictive influence of WTGUILT on ACCOUNT improves marginally when SELFEFF is in the model, though it still is not at my alpha of .05. When SELFEFF is not in the model, the coeff for WTGUILT is .086 and p-value .13. When SELFEFF is in the model, WTGUILT bumps up a tiny bit to .087 and the p-value approaches statistical significance, at .07. When I regress WTGUILT on SELFEFF, then the coeff is tiny (-0.0004311) and not statistically significant (p-value nearly 1.0).

This could suggest that ‘SELFEFF’ is acting as a suppressor variable. A suppressor variable is one that increases the predictive validity of another variable (or variables) by its inclusion in a regression equation.

The ‘yhat’ package in R has a function called ‘suppressorEffect()’ that can help you test for suppressor effects.

Another option is the ‘rms’ package. It provides a function called ‘vif()’, which can calculate the variance inflation factor (VIF). A low VIF for ‘SELFEFF’ might indicate it’s acting as a suppressor.

I got VIF values for all 4 of my IVs to test for collinearity using this formula: ACCOUNT ~ SELFEFF + WTGUILT + WTFEAR + WTANGER

Can I use those same VIF values to evaluate if SELFEF is a suppressor? Or is it a different function?

For a partial mediation model, you’d typically follow these steps: 1. First, run a regression with just the independent variables (IV) and dependent variable (DV). This gives you the total effect of the IV on the DV. 2. Next, run a regression with the IV and the mediator. This gives you the effect of the IV on the mediator. 3. Then, run a regression with the mediator and the DV, controlling for the IV. This gives you the effect of the mediator on the DV, controlling for the IV. 4. Finally, run a regression with both the IV and the mediator on the DV. This gives you the direct effect of the IV on the DV, controlling for the mediator.

Robust Regression

(After sweating through the full bootstrap thing at the end of this file, that guy said “if you have real outliers or very influential observations in your data, they will be given more weight then needed. In this case you would use a robust regression,” which brings me back to thist)

library(MASS)

#fit robust regression model
#antiracism.robust <- rlm(ACCOUNT ~ WTFEAR + WTGUILT + WTANGER + SELFEFF, data=subset(coredata, extremelyinfluential == FALSE))

# since it's robust, and since the extreme influential point actually is not a data entry error that I can tell, include the full dataset here
antiracism.robust <- rlm(ACCOUNT ~ WTFEAR + WTGUILT + WTANGER + SELFEFF, data=coredata)
summary(antiracism.robust)

## 
## Call: rlm(formula = ACCOUNT ~ WTFEAR + WTGUILT + WTANGER + SELFEFF, 
##     data = coredata)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.50995 -0.24448  0.02026  0.28598  0.70805 
## 
## Coefficients:
##             Value   Std. Error t value
## (Intercept)  2.1614  0.6453     3.3494
## WTFEAR      -0.2227  0.0913    -2.4399
## WTGUILT      0.0705  0.0473     1.4889
## WTANGER      0.2824  0.0990     2.8519
## SELFEFF      0.5987  0.1265     4.7314
## 
## Residual standard error: 0.3856 on 59 degrees of freedom

plot(antiracism.robust)

Now compare to the ordinary least squares regression of the model, with the influential outlier removed. The residual standard error returned next will measure the standard deviation of the residuals in each of the models. A lower RSE means a better fit to the data.

#find residual standard error of ols model with outliers removed
summary(antiracism2.lm)$sigma

## [1] 0.3952902

#find residual standard error of robust  model
summary(antiracism.robust)$sigma

## [1] 0.3855842

visualize the results

coredata$robust_fitted <- fitted(antiracism.robust)
coredata$classical_fitted <- fitted(antiracism.lm)

ggplot(coredata, aes(x = ACCOUNT, y = SELFEFF)) +
  geom_point() +
  geom_line(aes(y = robust_fitted), color = "red") +
  geom_line(aes(y = classical_fitted), color = "blue") +
  ggtitle("Comparison of Robust vs. Classical Regression")

ANOVA

The null hypothesis of these one-way ANOVA calculations is that the group means are all equal.

ACCOUNT and sexuality

# from https://personality-project.org/r/r.guide.html#descriptive
aov.ex1 = aov(ACCOUNT~sexuality,data = subset(coredata, extremelyinfluential == FALSE))  #do the analysis of variance
summary(aov.ex1)                                    #show the summary table

##             Df Sum Sq Mean Sq F value Pr(>F)  
## sexuality    7  3.873  0.5533   2.734 0.0165 *
## Residuals   55 11.131  0.2024                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

print(model.tables(aov.ex1,"means"),digits=3)

## Tables of means
## Grand mean
##          
## 5.506614 
## 
##  sexuality 
##     Alternate or self-identified answer Bisexual  Gay Heterosexual or straight
##                                    5.71     5.41 3.75                     5.51
## rep                                2.00     8.00 1.00                    40.00
##     Lesbian Pansexual Queer Undecided/Exploring
##           6      5.81  5.54                5.83
## rep       1      3.00  7.00                1.00

    #report the means and the number of subjects/cell
boxplot(ACCOUNT~sexuality,data = subset(coredata, extremelyinfluential == FALSE))

    #graphical summary appears in graphics window

The p-value is < .05 which means that the null hypothesis - that the groups are equivalent on ACCOUNT - is rejected. (Need to fix the labels on the box plots, but they show it too.)

ACCOUNT and age group

aov.age = aov(ACCOUNT~agecat,data = subset(coredata, extremelyinfluential == FALSE))  #do the analysis of variance
summary(aov.age)                                    #show the summary table

##             Df Sum Sq Mean Sq F value Pr(>F)
## agecat       4   0.72  0.1801   0.731  0.574
## Residuals   58  14.28  0.2463

print(model.tables(aov.age,"means"),digits=3)

## Tables of means
## Grand mean
##          
## 5.506614 
## 
##  agecat 
##     18 to 24 25 to 34 35 to 44 45 to 54 55 to 64
##         5.06     5.48     5.52     5.58     5.53
## rep     3.00    14.00    24.00    16.00     6.00

    #report the means and the number of subjects/cell
boxplot(ACCOUNT~agecat,data = subset(coredata, extremelyinfluential == FALSE))

    #graphical summary appears in graphics window

The box plots and the p-value > .05 actually both agree that there are not significant differences in ACCOUNT by age groups in the sample.

ACCOUNT and region

aov.region = aov(ACCOUNT~region,data = subset(coredata, extremelyinfluential == FALSE))  #do the analysis of variance
summary(aov.region)                                    #show the summary table

##             Df Sum Sq Mean Sq F value Pr(>F)
## region       3  0.912  0.3042   1.273  0.292
## Residuals   59 14.092  0.2388

print(model.tables(aov.region,"means"),digits=3)

## Tables of means
## Grand mean
##          
## 5.506614 
## 
##  region 
##     Midwest Northeast South  West
##        5.29      5.63  5.56  5.57
## rep   15.00      8.00 26.00 14.00

    #report the means and the number of subjects/cell
boxplot(ACCOUNT~region,data = subset(coredata, extremelyinfluential == FALSE))

    #graphical summary appears in graphics window

ACCOUNT and SES (class)

aov.class = aov(ACCOUNT~class,data = subset(coredata, extremelyinfluential == FALSE))  #do the analysis of variance
summary(aov.class)                                    #show the summary table

##             Df Sum Sq Mean Sq F value Pr(>F)
## class        4  0.557  0.1393   0.559  0.693
## Residuals   58 14.447  0.2491

print(model.tables(aov.class,"means"),digits=3)

## Tables of means
## Grand mean
##          
## 5.506614 
## 
##  class 
##     Lower middle class Skip/Prefer not to say Upper class Upper middle class
##                    5.5                   5.92        5.92               5.47
## rep               17.0                   1.00        2.00              38.00
##     Working class
##              5.55
## rep          5.00

    #report the means and the number of subjects/cell
boxplot(ACCOUNT~class,data = subset(coredata, extremelyinfluential == FALSE))

    #graphical summary appears in graphics window

ACCOUNT and counseling experience (years practicing)

aov.yrsprac = aov(ACCOUNT~yrsprac,data = subset(coredata, extremelyinfluential == FALSE))  #do the analysis of variance
summary(aov.yrsprac)                                    #show the summary table

##             Df Sum Sq Mean Sq F value Pr(>F)
## yrsprac      5  1.964  0.3928   1.717  0.145
## Residuals   57 13.040  0.2288

print(model.tables(aov.yrsprac,"means"),digits=3)

## Tables of means
## Grand mean
##          
## 5.506614 
## 
##  yrsprac 
##      0-1 16-24  2-4 25 or more  5-8  9-15
##      5.3  5.55  5.8       5.39 5.53  5.43
## rep 15.0  7.00 14.0       3.00 9.00 15.00

    #report the means and the number of subjects/cell
boxplot(ACCOUNT~yrsprac,data = subset(coredata, extremelyinfluential == FALSE))

    #graphical summary appears in graphics window

SKIP EVERYTHING BELOW

Bootstrap

Resampling. Recommended by coaching D1 for working with nonparametric data. Likely due to small sample size.

Learning resource - bootstrapping using the boot package

(Abandoned below…)

# built from example at https://www.statmethods.net/advstats/bootstrapping.html
library(boot)
# function to obtain regression weights

# bs <- function(formula, data, indices)
# {
#   d <- data[indices,] # allows boot to select sample
#   fit <- lm(formula, data=d)
#   return(coef(fit))
# }
# # bootstrapping with 1000 replications
# results <- boot(data = subset(coredata, extremelyinfluential == FALSE), statistic=bs,
#    R=1000, formula=ACCOUNT ~WTFEAR + WTGUILT + WTANGER + SELFEFF)
# 
# # view results
# results
# plot(results, index=1) # intercept
# plot(results, index=2) # WTGUILT
# plot(results, index=3) # WTGUILT
# plot(results, index=4) # WTANGER
# plot(results, index=5) # SELFEFF
# 
# # get 95% confidence intervals
# boot.ci(results, type="bca", index=1) # intercept
# boot.ci(results, type="bca", index=2) # WTGUILT
# boot.ci(results, type="bca", index=3) # WTGUILT
# boot.ci(results, type="bca", index=4) # WTANGER
# boot.ci(results, type="bca", index=5) # SELFEFF

Let’s start fresh… Mon Feb 4.

Learning resource - non parametric boostrapped regression via tidymodels His blog post

Bootstrapping do not have any assumptions of the data to meet.

Works better for small samples.

Describes variation in the data better than standard lm.

library(tidymodels)

# this makes the boostrap sampling reproducible, it always will have the same data based on this call (the 108 is random, could be anything)
set.seed(108)

# Bootstrap the data with resampling

boot_data <- bootstraps(data = subset(coredata, extremelyinfluential == FALSE & influential == FALSE), times = 2000) 


# bootstrap new regressions and unnest all fits
boot_models <- boot_data %>% 
  mutate(model     = map(splits, ~lm(ACCOUNT ~ SELFEFF + WTANGER + WTGUILT + WTFEAR, data = subset(coredata, extremelyinfluential == FALSE))), 
         augmented = map(model, augment) )

boot_models

boot_aug <- boot_models %>% 
  unnest(augmented) %>% 
  dplyr::select(-splits, -model)

boot_aug



#### uSE BOOT_AUG FOR THE LM





ggplot(data = boot_aug, aes(x = SELFEFF, y = ACCOUNT)) +
  geom_line(aes(y = .fitted, group = id), col = "grey",  size = 0.25) 










# # Plot estimates distribution of 1000 estimates
# boot_coefs <- boot_models %>%
#   unnest(coefs) %>%
#   dplyr::select(term, estimate)
# 
# 
# 
# boot_coefs
# 
# boot_meds <- boot_coefs %>% 
#   group_by(term) %>% 
#   summarise(
#     med_est   = median(estimate),
#     quantile  = quantile(estimate, 0.5  ),
#     conf.low  = quantile(estimate, 0.025),
#     conf.high = quantile(estimate, 0.975),
#     conf.25   = quantile(estimate, 0.25 ),
#     conf.75   = quantile(estimate, 0.75 ))
# 
# boot_meds
# 
# 
# 
# boot_coefs %>%
#   ggplot(aes(x = estimate)) +
#   geom_rect(aes(x = med_est, xmin = conf.low, xmax = conf.high, ymin = 0, ymax = Inf), data = boot_meds, alpha = 0.1, fill = "green") +
#   geom_rect(aes(x = med_est, xmin = conf.25, xmax = conf.75, ymin = 0, ymax = Inf), data = boot_meds, alpha = 0.3, fill = "green") +
#   geom_density() +
#   geom_vline(data = boot_meds, aes(xintercept = med_est))+
#   facet_wrap(~term, scales = "free")
# 
# 
# boot_models <- bootstraps(data = subset(coredata, extremelyinfluential == FALSE), times = 10000) %>% 
#   mutate(model = map(splits, ~lm(ACCOUNT ~ 0 + SELFEFF + WTANGER + WTGUILT + WTFEAR, data = subset(coredata, extremelyinfluential == FALSE))), 
#          coefs = map(model, tidy) ) # optional "conf.int = TRUE"
# 
# boot_coefs <- boot_models %>%
#   unnest(coefs) %>%
#   dplyr::select(term, p.value)
# 
# 
# 
# boot_coefs %>%
#   ggplot(aes(x = p.value)) +
#   geom_histogram() +
#   facet_wrap(~term, scales = "free")
# 
# boot_aug <- boot_models %>%
#   mutate(augmented = map(model, augment)) %>%
#   unnest(augmented) %>%
#   dplyr::select(-splits, -model)
# 
# 
# 
# # get median estimates, median 95% CIs
# nonpar_med_boot_preds <- boot_aug %>%
#   group_by(SELFEFF) %>%
#   summarise(
#     predicted = median(.fitted ),
#     conf.low  = quantile(.fitted, 0.025),
#     conf.high = quantile(.fitted, 0.975)) %>%
#   mutate(model = "bootstrapped") %>%
#   dplyr::select(model, everything())
# 
# nonpar_med_boot_preds

Run the LM against the boostrapped data

# Note that this is the first lm function in this code with influentials also omitted
# antiracismbooted.lm<-lm(ACCOUNT ~ WTFEAR + WTGUILT + WTANGER + SELFEFF, data = subset(coredata, extremelyinfluential == FALSE & influential == FALSE))

antiracismbooted.lm<-lm(ACCOUNT ~ WTFEAR + WTGUILT + WTANGER + SELFEFF, data = coredata)
model_performance(antiracismbooted.lm)

summary(antiracismbooted.lm)
residualsbooted <- resid(antiracismbooted.lm)
report(antiracismbooted.lm)

# Below works just fine, but the standard plot outputs gives more detail
# plot(fitted(antiracism.lm), residuals)

Plots of bootstrapped LM

plot(antiracismbooted.lm)

#Create density plot of residuals
plot(density(residualsbooted))

Shapiro-Wilk text on residuals from LM of booted dataset

shapiro.test(residualsbooted)

Robust ??

library(robustbase)
rm <- lmrob(ACCOUNT ~ SELFEFF + WTANGER + WTGUILT + WTFEAR, data = subset(coredata, extremelyinfluential == FALSE))
summary(rm)

Poisson

Keeping this in here for now, though further reading indicates that Poisson is best for a “count” dependent variable (such as, the number of people who die each year from bee stings). My DV is not a count.

# from https://www.geeksforgeeks.org/poisson-regression-in-r-programming/

# output <-glm(formula = breaks ~ wool + tension, 
#              data = warpbreaks, family = poisson) 
# print(summary(output)) 

# the below runs but I don't think it's a valid function to run on this data
outputpo <-glm(formula = ACCOUNT ~ WTFEAR + WTGUILT + WTANGER + SELFEFF, data = subset(coredata, extremelyinfluential == FALSE), family = poisson) 
print(summary(output)) 

plot(outputpo)

Other types of regressions I’ve tried to run

An ordinal logistic regression is for a dependent variable with at least three groups in a natural order, which the ACCOUNTABILITY measure seems to have (given the available answers to respondents of Agree through Disagree - wait, no: it’s a total score of the scale which is actually a continuous variable right?). See types of regression here: https://statisticsbyjim.com/regression/choosing-regression-analysis/

I’m really stuck on whether the variables, as summations of ordered values, are actually ordinal or if they are interval. :-(

Generalized Linear Model

From https://towardsdatascience.com/generalized-linear-models-9ec4dfe3dc3f

I considered this however just could not figure out which family to use. It’s not count so not Poisson. It’s not binomial. I don’t think it’s Gaussian. Blech.

output.glm <-glm(formula = ACCOUNT ~ WTFEAR + WTGUILT + WTANGER + SELFEFF, data = subset(coredata, extremelyinfluential == FALSE), family = poisson) 
print(summary(output)) 

summary(output.glm)
plot(output.glm)

Two-Stage Least-Squares Regression using ivreg

Two-Stage Least-Squares Regression with Diagnostics This uses instrumental variables (which I’m doubtful that my variables actually are) Subsequently from Arcenis 3/15/24 (see Wyzant for message): “I looked through some literature on 2-stage least squares (2SLS) and found that it really is meant for use with instrumental variables. Your model would ideally be run as a sturctural equation model (SEM) meaning the {lavaan} package. The good news is that I think that the specification error in your package likely happened because of misspecification. You might have set it to test for covariances between/among all of your explanatory variables, which would give you fewer degrees of freedom. That’s easily addressable if it is the problem.

If that doesn’t work, you can still try 2SLS, but you’d have to specify the first stage first, then use those estimates to use as an OLS regressor (very doable), you’d just have to ensure that the outcome variable is not highly correlated with your DV. I can help you with that. If you do it this way, then 2SLS is valid.”

My interpretation: This doesn’t work based on my variable types. Variables are definitely not instrumental, there is direct effect of all of them on ACCOUNT.

Also, if I’m understanding the second recommendation: The outcome variable of the first stage would be SELFEFF, and it is actually highly correlated with the DV ACCOUNT. So I think it wouldn’t work for that reason also.

# ivreg package
library(ivreg)

## Warning: package 'ivreg' was built under R version 4.3.3

m_iv <- ivreg(ACCOUNT ~ SELFEFF | WTANGER + WTGUILT + WTFEAR, data = coredata)
summary(m_iv)

## 
## Call:
## ivreg(formula = ACCOUNT ~ SELFEFF | WTANGER + WTGUILT + WTFEAR, 
##     data = coredata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.53007 -0.62336  0.06414  0.49168  1.15835 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.7236     1.5671  -0.462   0.6459    
## SELFEFF       1.8551     0.4691   3.954   0.0002 ***
## 
## Diagnostic tests:
##                  df1 df2 statistic p-value    
## Weak instruments   3  60     3.768 0.01516 *  
## Wu-Hausman         1  61    13.923 0.00042 ***
## Sargan             2  NA     2.057 0.35753    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6611 on 62 degrees of freedom
## Multiple R-Squared: -0.2296, Adjusted R-squared: -0.2494 
## Wald test: 15.64 on 1 and 62 DF,  p-value: 0.0001996

Statistical Mediation Analysis (invalid? apparently underpowered)

This type of analysis is frequently done in clinical trials. It includes sensitivity assessments. It appears to be somewhat of an alternative to path analysis/SEM.

Rule of thumb for mediation analysis is ten observations per variable. Could not find a direct statement of this, however Fritz & MacKinnon (2007) reported on a metaanalysis of different types of mediation analyses. Hair et al. (2011) gives rules of thumb that may be applicable.

The mediation package in R primarily uses regression-based methods for mediation analysis (it does not use structural equation modeling). The main function in the package, “mediate”, estimates the causal mediation effects using regression models for the mediator and the outcome, via tests of the indirect effect (i.e., non-SEM product-of-coefficients tests)

From Fritz & MacKinnon (2007): “For example, consider a researcher interested in testing whether intentions mediate the relation between attitudes and behaviors, as proposed by the theory of planned behavior (Ajzen, 1985). The researcher believes that in the study, the effect of attitudes on intentions will be of medium size, the effect of in- tentions on behaviors will be small, and intentions will com- pletely mediate the effect of attitudes on behavior. Using the empirical power tables, the researcher can see that for the joint significance test, a sample size of 405, or a larger sample if measurement error is present, is required for .8 power.”

Converting that to my study: I now know for example (given that I have analysis underway) that total effect of WTANGER on ACCOUNT is of .4 which is medium size (alpha path), the indirect effect of WTANGER on ACCOUNT through SELFEFF is small at .11 (beta path), and SELFEFF does not completely mediate the effect of the WTANGER on ACCOUNT. Using the table in Fritz & MacKinnon (2007) to estimate sample size needed for .8 power, the alpha path is M, the beta path is S, the method is PRODCLIN (indirect product of coefficients) which gives sample size needed of 404 (if I’m doing it correctly).

For future analyses where using this, see Ato García et al. (2014) for guidance in interpreting results - for example their paragraph here which describes their own analysis using mediation package that closely resembles what I have done:

“The two equations were estimated with the R-lm function, the mediation effects with the R-mediate function and the sensitivity analysis with the R-medsens function of the mediation package. For the random replicate used, we found a non-significant indirect effect (ACME = 0.0755, p = .08) and a mediated proportion of 0.115 (p = 0.08), so the presence of a mediating effect was discarded. The second part of Output 1 (Table 1) is a sensitivity analysis with the medsens function, which reveals that the confidence interval of indirect effects includes zero in the entire spectrum of the sensitivity parameter ρ (from - 0.9 to + 0.9). In fact, sensitivity analysis is not required when the indirect effect is not significant. Figure 4 is a plot of sensitivity analysis and shows, together with the axes of indirect effect and ρ, the observed mediating effect (dashed line) and the values that the indirect effect would reach varying the sensitivity parameter (solid curved line) along its range in steps of .10. Note that the confidence interval (limits represented with a grey background) always includes zero for the indirect effect whatever the value of ρ is. This result is expected as a consequence of imposing a zero correlation between the residuals of Equations (1) and (2).”

(My commentary: Remember, if zero correlation between the residuals, or rho = 0, that means that there is no identifiable relationship between the way the predictors move in their respective model equations)

Note that in each of the Sensitivity Region matrixes, it reports on Rho at which ACME = 0: 0.5. This is not an output of the sensitivity calculation, it’s saying how the calculation was done. That rho is not a reportable number relevant to the variables.

library(mediation)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loading required package: mvtnorm

## Loading required package: sandwich

## mediation: Causal Mediation Analysis
## Version: 4.5.0

# The psych package also has a mediate() function

# https://cran.r-project.org/web/packages/mediation/vignettes/mediation.pdf
# Model Based Causal Mediation Analysis

# mediator model
model.m <- lm(SELFEFF ~ WTANGER + WTGUILT + WTFEAR, data=coredata)

# outcome model
model.y <- lm(ACCOUNT ~ SELFEFF + WTANGER + WTGUILT + WTFEAR, data = coredata)

# # mediator model
# model.m <- lm(SELFEFF ~ WTANGER, data=coredata)
# 
# # outcome model
# model.y <- lm(ACCOUNT ~ SELFEFF + WTANGER, data = coredata)

# causal mediation analysis - WTANGER
 m1.out <- mediation::mediate(model.m, model.y, sims=1000, treat="WTANGER", mediator="SELFEFF")

summary(m1.out)

## 
## Causal Mediation Analysis 
## 
## Quasi-Bayesian Confidence Intervals
## 
##                Estimate 95% CI Lower 95% CI Upper p-value    
## ACME            0.11310     -0.00122         0.25   0.054 .  
## ADE             0.30112      0.10233         0.49   0.004 ** 
## Total Effect    0.41421      0.19696         0.63  <2e-16 ***
## Prop. Mediated  0.26653     -0.00319         0.61   0.054 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sample Size Used: 64 
## 
## 
## Simulations: 1000

#plot(m1.out)

# sensitivity analysis - WTANGER
s1.out <- medsens(m1.out)
summary(s1.out)

## 
## Mediation Sensitivity Analysis for Average Causal Mediation Effect
## 
## Sensitivity Region
## 
##        Rho    ACME 95% CI Lower 95% CI Upper R^2_M*R^2_Y* R^2_M~R^2_Y~
##  [1,] -0.9  0.4886      -0.0127       0.9899         0.81       0.3166
##  [2,] -0.8  0.3561      -0.0105       0.7228         0.64       0.2501
##  [3,] -0.7  0.2922      -0.0097       0.5941         0.49       0.1915
##  [4,] -0.6  0.2505      -0.0093       0.5103         0.36       0.1407
##  [5,] -0.5  0.2192      -0.0092       0.4476         0.25       0.0977
##  [6,] -0.4  0.1937      -0.0092       0.3966         0.16       0.0625
##  [7,] -0.3  0.1716      -0.0094       0.3525         0.09       0.0352
##  [8,] -0.2  0.1516      -0.0096       0.3128         0.04       0.0156
##  [9,] -0.1  0.1328      -0.0101       0.2757         0.01       0.0039
## [10,]  0.0  0.1146      -0.0108       0.2400         0.00       0.0000
## [11,]  0.1  0.0964      -0.0118       0.2046         0.01       0.0039
## [12,]  0.2  0.0776      -0.0134       0.1687         0.04       0.0156
## [13,]  0.3  0.0576      -0.0163       0.1316         0.09       0.0352
## [14,]  0.4  0.0355      -0.0221       0.0932         0.16       0.0625
## [15,]  0.5  0.0100      -0.0359       0.0559         0.25       0.0977
## [16,]  0.6 -0.0213      -0.0710       0.0285         0.36       0.1407
## [17,]  0.7 -0.0630      -0.1413       0.0154         0.49       0.1915
## [18,]  0.8 -0.1269      -0.2641       0.0103         0.64       0.2501
## [19,]  0.9 -0.2594      -0.5282       0.0094         0.81       0.3166
## 
## Rho at which ACME = 0: 0.5
## R^2_M*R^2_Y* at which ACME = 0: 0.25
## R^2_M~R^2_Y~ at which ACME = 0: 0.0977

plot(s1.out)

# summarize results

#Causal Mediation Analysis - WTGUILT
m2.out <- mediation::mediate(model.m, model.y, sims=1000, treat="WTGUILT", mediator="SELFEFF")

summary(m2.out)

## 
## Causal Mediation Analysis 
## 
## Quasi-Bayesian Confidence Intervals
## 
##                Estimate 95% CI Lower 95% CI Upper p-value  
## ACME            0.00101     -0.05790         0.07   0.976  
## ADE             0.08478     -0.00629         0.17   0.064 .
## Total Effect    0.08579     -0.01801         0.19   0.102  
## Prop. Mediated  0.04260     -3.11795         2.50   0.890  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sample Size Used: 64 
## 
## 
## Simulations: 1000

#plot(m2.out)

# sensitivity analysis - WTGUILT
s2.out <- medsens(m2.out)
summary(s2.out)

## 
## Mediation Sensitivity Analysis for Average Causal Mediation Effect
## 
## Sensitivity Region
## 
##        Rho    ACME 95% CI Lower 95% CI Upper R^2_M*R^2_Y* R^2_M~R^2_Y~
##  [1,] -0.9 -0.0011      -0.2466       0.2443         0.81       0.3166
##  [2,] -0.8 -0.0008      -0.1797       0.1780         0.64       0.2501
##  [3,] -0.7 -0.0007      -0.1474       0.1461         0.49       0.1915
##  [4,] -0.6 -0.0006      -0.1264       0.1252         0.36       0.1407
##  [5,] -0.5 -0.0005      -0.1106       0.1096         0.25       0.0977
##  [6,] -0.4 -0.0005      -0.0977       0.0968         0.16       0.0625
##  [7,] -0.3 -0.0004      -0.0866       0.0858         0.09       0.0352
##  [8,] -0.2 -0.0004      -0.0765       0.0758         0.04       0.0156
##  [9,] -0.1 -0.0003      -0.0670       0.0664         0.01       0.0039
## [10,]  0.0 -0.0003      -0.0578       0.0573         0.00       0.0000
## [11,]  0.1 -0.0002      -0.0486       0.0482         0.01       0.0039
## [12,]  0.2 -0.0002      -0.0392       0.0388         0.04       0.0156
## [13,]  0.3 -0.0001      -0.0291       0.0288         0.09       0.0352
## [14,]  0.4 -0.0001      -0.0179       0.0178         0.16       0.0625
## [15,]  0.5  0.0000      -0.0051       0.0050         0.25       0.0977
## [16,]  0.6  0.0000      -0.0106       0.0107         0.36       0.1407
## [17,]  0.7  0.0001      -0.0315       0.0318         0.49       0.1915
## [18,]  0.8  0.0003      -0.0635       0.0641         0.64       0.2501
## [19,]  0.9  0.0006      -0.1297       0.1309         0.81       0.3166
## 
## Rho at which ACME = 0: 0.5
## R^2_M*R^2_Y* at which ACME = 0: 0.25
## R^2_M~R^2_Y~ at which ACME = 0: 0.0977

plot(s2.out)

#Causal Mediation Analysis - WTFEAR
m3.out <- mediation::mediate(model.m, model.y, sims=1000, treat="WTFEAR", mediator="SELFEFF")

summary(m3.out)

## 
## Causal Mediation Analysis 
## 
## Quasi-Bayesian Confidence Intervals
## 
##                Estimate 95% CI Lower 95% CI Upper p-value    
## ACME             -0.149       -0.291        -0.04   0.008 ** 
## ADE              -0.244       -0.431        -0.08   0.006 ** 
## Total Effect     -0.394       -0.604        -0.19  <2e-16 ***
## Prop. Mediated    0.374        0.119         0.71   0.008 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sample Size Used: 64 
## 
## 
## Simulations: 1000

#plot(m3.out)

# sensitivity analysis - WTFEAR
s3.out <- medsens(m3.out)
summary(s3.out)

## 
## Mediation Sensitivity Analysis for Average Causal Mediation Effect
## 
## Sensitivity Region
## 
##      Rho    ACME 95% CI Lower 95% CI Upper R^2_M*R^2_Y* R^2_M~R^2_Y~
## [1,] 0.3 -0.0748      -0.1531       0.0036         0.09       0.0352
## [2,] 0.4 -0.0461      -0.1126       0.0204         0.16       0.0625
## [3,] 0.5 -0.0130      -0.0717       0.0458         0.25       0.0977
## [4,] 0.6  0.0276      -0.0336       0.0888         0.36       0.1407
## [5,] 0.7  0.0817      -0.0001       0.1634         0.49       0.1915
## 
## Rho at which ACME = 0: 0.5
## R^2_M*R^2_Y* at which ACME = 0: 0.25
## R^2_M~R^2_Y~ at which ACME = 0: 0.0977

plot(s3.out)

Rough conclusions based on this output

(Reflecting my current knowledge/understanding…)

Path Analysis (invalid for this study?? underpowered?)

3/11/24 This analysis is not appropriate for this dataset; not enough observations, underpowered. The attempts at using these functions also returned low-quality results (poor goodness of fit measures). Dr. Patrick commented that this study was not set up to do this type of analysis. I am keeping this in the notebook for historical purposes, however this will not be used in the dissertation.

Note: I also wrote up a lot of these analytics and transferred the code into Ch 4 before realizing that it is underpowered. All of that effort has been cut from the ms and placed in the ch1-snippets.RMD file. There are some good diagrams there (they’re just not valid). Maybe a future study.

3/15/24 Arcenis says that SEM may be appropriate and recommended that lavaan may actually be a valid package to use. (First began discussing using two-stage least squares regression, which is used when there is an instrumental variable; he had misunderstood my “IV” notation as that, when I meant it as “independent variables.” The white racial affects are NOT instrumental variables; instrumental variables are correlated with the predictor but not correlated with the outcome variable, and that’s not the case here, the emotions show a relationship to selfeff and also directly on account.)

See here for pros of SEM and when to avoid: https://stats.stackexchange.com/questions/11255/whether-to-use-structural-equation-modelling-to-analyse-observational-studies-in/187406#187406

https://stats.stackexchange.com/questions/187924/sample-size-for-sem-with-ordinal-measures

New using SEMinR

# uses seminr package

measurements <- constructs(
  reflective("Accountability",    single_item("ACCOUNT")),
  reflective("White Anger", single_item("WTANGER")),
  reflective("White Guilt", single_item("WTGUILT")),
  reflective("White Fear", single_item("WTFEAR")),
  reflective("Self Efficacy",  single_item("SELFEFF"))
)


# create multiple paths "from" and "to" sets of constructs
structure <- relationships(
  paths(from = c("WTANGER", "WTGUILT", "WTFEAR"), to = "SELFEFF"),
  paths(from = "SELFEFF", to = c("ACCOUNT"))
)

cbsem_model <- estimate_cbsem(data = coredata, measurements, structure)

Resulted in warning: Error in try_or_stop(lavaan::sem(model = lavaan_model, data = data, std.lv = TRUE, : Could not estimating CBSEM using Lavaan: lavaan WARNING: The variance-covariance matrix of the estimated parameters (vcov) does not appear to be positive definite! The smallest eigenvalue (= -3.988603e-03) is smaller than zero. This may be a symptom that the model is not identified.

GOOD TUTORIAL ON PATH https://rpubs.com/tbihansk/302732

See here for description of fit statistics https://davidakenny.net/cm/fit.htm

The problem is, in this first output using the cfa() function, the CFI .710 and TLI .324 are not showing goodness of fit, those statistics indicate a poor model.

library(lavaan)

## This is lavaan 0.6-17
## lavaan is FREE software! Please report any bugs.

library(lavaanPlot)
library(semPlot)
library(OpenMx)
library(tidyverse)
library(knitr)
library(kableExtra)
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

pathmodel <- '
ACCOUNT ~ SELFEFF
SELFEFF ~ WTANGER + WTGUILT + WTFEAR
'
fit <- cfa(pathmodel, data = coredata)

summary(fit, fit.measures = TRUE, standardized=T,rsquare=T)

## lavaan 0.6.17 ended normally after 2 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         6
## 
##   Number of observations                            64
## 
## Model Test User Model:
##                                                       
##   Test statistic                                18.384
##   Degrees of freedom                                 3
##   P-value (Chi-square)                           0.000
## 
## Model Test Baseline Model:
## 
##   Test statistic                                60.126
##   Degrees of freedom                                 7
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.710
##   Tucker-Lewis Index (TLI)                       0.324
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)                -74.448
##   Loglikelihood unrestricted model (H1)        -65.256
##                                                       
##   Akaike (AIC)                                 160.896
##   Bayesian (BIC)                               173.849
##   Sample-size adjusted Bayesian (SABIC)        154.965
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.283
##   90 Percent confidence interval - lower         0.168
##   90 Percent confidence interval - upper         0.414
##   P-value H_0: RMSEA <= 0.050                    0.001
##   P-value H_0: RMSEA >= 0.080                    0.997
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.099
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   ACCOUNT ~                                                             
##     SELFEFF           0.819    0.130    6.276    0.000    0.819    0.617
##   SELFEFF ~                                                             
##     WTANGER           0.184    0.095    1.933    0.053    0.184    0.225
##     WTGUILT          -0.000    0.047   -0.009    0.993   -0.000   -0.001
##     WTFEAR           -0.238    0.085   -2.802    0.005   -0.238   -0.323
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .ACCOUNT           0.213    0.038    5.657    0.000    0.213    0.619
##    .SELFEFF           0.165    0.029    5.657    0.000    0.165    0.841
## 
## R-Square:
##                    Estimate
##     ACCOUNT           0.381
##     SELFEFF           0.159

pathresult <- lavaanPlot(model = fit, node_options = list(shape = "box", fontname = "Helvetica"), edge_options = list(color = "grey"), coefs = TRUE,  sig = 0.054)

pathresult

save_png(pathresult, "fig-pathresults.png")

model_performance(fit)

Now, the output using this sem() function shows 0 degrees of freedom, which is a just-identified model. Also not valid results.

# from https://advstats.psychstat.org/book/path/index.php 
mediationfear <- ' 
  ACCOUNT ~ b*SELFEFF + cp*WTFEAR
  SELFEFF ~ a*WTFEAR
  ab := a*b
  total := a*b + cp
'
mediationfear.res <- sem(mediationfear, data = coredata)

summary(mediationfear.res)

## lavaan 0.6.17 ended normally after 1 iteration
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         5
## 
##   Number of observations                            64
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   ACCOUNT ~                                           
##     SELFEFF    (b)    0.725    0.133    5.433    0.000
##     WTFEAR    (cp)   -0.211    0.098   -2.141    0.032
##   SELFEFF ~                                           
##     WTFEAR     (a)   -0.242    0.087   -2.784    0.005
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .ACCOUNT           0.199    0.035    5.657    0.000
##    .SELFEFF           0.175    0.031    5.657    0.000
## 
## Defined Parameters:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     ab               -0.176    0.071   -2.477    0.013
##     total            -0.386    0.112   -3.439    0.001

The model fits measures of CFI and TLI do not indicate goodness of fit (see writeup in ch 4 already)

This model shows only how WTFEAR may be mediated by SELFEFF on ACCOUNT. In interpreting those results: The output of the mediator model with SELFEFF as mediator between WTFEAR and ACCOUNT showed that if WTFEAR is regressed against SELFEFF, p-value is 0.005 and the coefficient estimate beta is -0,242. However, when included with SELFEFF in the regression against ACCOUNT, the coefficient beta decreases to -0.211. This indicates that SELFEFF mediates the relationship between WTFEAR and ACCOUNT. This also is confirmed by the ab and total: Both have low p-values, indicating the relationships are statistically significant.

This is different when we run the same thing on WTANGER:

# from https://advstats.psychstat.org/book/path/index.php 
mediationanger <- ' 
  ACCOUNT ~ b*SELFEFF + cp*WTANGER
  SELFEFF ~ a*WTANGER
  ab := a*b
  total := a*b + cp
'
mediationanger.res <- sem(mediationanger, data = coredata)

summary(mediationanger.res)

## lavaan 0.6.17 ended normally after 1 iteration
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         5
## 
##   Number of observations                            64
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   ACCOUNT ~                                           
##     SELFEFF    (b)    0.729    0.125    5.826    0.000
##     WTANGER   (cp)    0.315    0.102    3.085    0.002
##   SELFEFF ~                                           
##     WTANGER    (a)    0.190    0.099    1.910    0.056
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .ACCOUNT           0.186    0.033    5.657    0.000
##    .SELFEFF           0.185    0.033    5.657    0.000
## 
## Defined Parameters:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     ab                0.138    0.076    1.815    0.069
##     total             0.454    0.123    3.688    0.000

The results looking solely at WTANGER being mediated by SELFEFF show a non-statistically significant ab result which is the path from the emotion through SELFEFF to ACCOUNT. The beta for WTANGER also INCREASES in the full model.

And with WTGUILT the effects are actually even more pronounced:

# from https://advstats.psychstat.org/book/path/index.php 
mediationguilt <- ' 
  ACCOUNT ~ b*SELFEFF + cp*WTGUILT
  SELFEFF ~ a*WTGUILT
  ab := a*b
  total := a*b + cp
'
mediationguilt.res <- sem(mediationguilt, data = coredata)

summary(mediationguilt.res)

## lavaan 0.6.17 ended normally after 7 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         5
## 
##   Number of observations                            64
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   ACCOUNT ~                                           
##     SELFEFF    (b)    0.818    0.127    6.450    0.000
##     WTGUILT   (cp)    0.097    0.051    1.924    0.054
##   SELFEFF ~                                           
##     WTGUILT    (a)    0.001    0.050    0.022    0.983
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .ACCOUNT           0.202    0.036    5.657    0.000
##    .SELFEFF           0.196    0.035    5.657    0.000
## 
## Defined Parameters:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     ab                0.001    0.041    0.022    0.983
##     total             0.098    0.065    1.511    0.131

WTGUILT has an almost 0 coefficient estimate when regressed on SELFEFF and a very high p-value of almost 1. WTGUILT also does not appear to be a statistically significant influence on ACCOUNT at all. Only the coefficients with p < 0.05 are included in the below diagram.

And try it with all 3

library(lavaanPlot)
# https://www.alexlishinski.com/post/lavaanplot-0-5-1/

mediationall <- ' 
  # direct effect
  ACCOUNT ~ d*SELFEFF + ep*WTANGER + fp*WTGUILT + gp*WTFEAR
  
  # mediator
  SELFEFF ~ a*WTANGER + b*WTGUILT + c*WTFEAR
  
  # indirect effect of the white racial affects
  ad := a*d #WTANGER
  bd := b*d #WTGUILT
  cd := c*d #WTFEAR
  
  # total effect
  total := a*d + b*d + c*d + ep + fp + gp
'
mediationall.res <- sem(mediationall, data = coredata)

summary(mediationall.res)

## lavaan 0.6.17 ended normally after 1 iteration
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         9
## 
##   Number of observations                            64
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   ACCOUNT ~                                           
##     SELFEFF    (d)    0.623    0.123    5.061    0.000
##     WTANGER   (ep)    0.299    0.096    3.103    0.002
##     WTGUILT   (fp)    0.087    0.046    1.890    0.059
##     WTFEAR    (gp)   -0.246    0.089   -2.768    0.006
##   SELFEFF ~                                           
##     WTANGER    (a)    0.184    0.095    1.933    0.053
##     WTGUILT    (b)   -0.000    0.047   -0.009    0.993
##     WTFEAR     (c)   -0.238    0.085   -2.802    0.005
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .ACCOUNT           0.160    0.028    5.657    0.000
##    .SELFEFF           0.165    0.029    5.657    0.000
## 
## Defined Parameters:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     ad                0.115    0.063    1.806    0.071
##     bd               -0.000    0.029   -0.009    0.993
##     cd               -0.149    0.061   -2.451    0.014
##     total             0.106    0.151    0.703    0.482

confirmedpath <- lavaanPlot(model = mediationall.res, node_options = list(shape = "box", fontname = "Helvetica"), edge_options = list(color = "grey"), coefs = TRUE,covs = TRUE, stars = c("regress"))

confirmedpath

save_png(confirmedpath, "fig-confirmedpath.png")

model_performance(mediationall.res)

The Model Test User Model output by lavaan of Test statistic 0.000 and Degrees of freedom 0 indicate that this is a just-identified model which is a sign that it may be overfit to the data.

Including all three white racial affects in the mediation model gives same results, at least based on p-values: Only WTFEAR is mediated by SELFEFF (p = .005 in regression against ACCOUNT; the other two p > .05, though WTANGER p = .053 which rounds to .05 so I don’t know how to consider that??). WTGUILT is not an important element in the model at all.

There are correlation coefficients between the other two white racial affects, as shown in the path diagram, but they are not statistically significant.

The influence of SELFEFF on ACCOUNT is lessened with all three emotions in the model - the beta here for SELFEFF on ACCOUNT is 0.623, it has a much higher influence when only one of the emotions are included. So this means that… what? That the emotions are actually moderating the influence of SELFEFF?

Test with no direct effect - this test actually shows that SELFEFF as a mediator variable is not sufficient, that the direct effect of the emotions is also needed to improve the fit of the model.

The code below shows the model without any direct effects from any of the emotions to ACCOUNT. This is only showing effect of the emotions through SELFEFF. This is from this tutorial

From the text of that tutorial: “To evaluate the hypothesis, we can check the model fit. The null hypothesis is “H0: The model fits the data well or the model is supported”. The alternative hypothesis is “H1: The model does not fit the data or the model is rejected”. The model with the direct effect fits the data perfectly. Therefore, if the current model also fits the data well, we fail to reject the null hypothesis. Otherwise, we reject it. The test of the model can be conducted based on a chi-squared test.”

Applying it to my results: From the output, the Chi-square is 18.384 with 3 degree of freedom. The p-value is about 0. Therefore, the null hypothesis is rejected. This indicates that the model without direct effect is not a good model.

# https://advstats.psychstat.org/book/path/index.php

mediationnodirecte <- ' 
  # direct effect
  ACCOUNT ~ d*SELFEFF 
  
  # mediator
  SELFEFF ~ a*WTANGER + b*WTGUILT + c*WTFEAR
  

'
mediationnodirecte.res <- sem(mediationnodirecte, data = coredata)

summary(mediationnodirecte.res)

## lavaan 0.6.17 ended normally after 2 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         6
## 
##   Number of observations                            64
## 
## Model Test User Model:
##                                                       
##   Test statistic                                18.384
##   Degrees of freedom                                 3
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   ACCOUNT ~                                           
##     SELFEFF    (d)    0.819    0.130    6.276    0.000
##   SELFEFF ~                                           
##     WTANGER    (a)    0.184    0.095    1.933    0.053
##     WTGUILT    (b)   -0.000    0.047   -0.009    0.993
##     WTFEAR     (c)   -0.238    0.085   -2.802    0.005
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .ACCOUNT           0.213    0.038    5.657    0.000
##    .SELFEFF           0.165    0.029    5.657    0.000

checkwithoutdirecte <- lavaanPlot(model = mediationnodirecte.res, node_options = list(shape = "box", fontname = "Helvetica"), edge_options = list(color = "grey"), coefs = TRUE,  sig = 0.054)

checkwithoutdirecte

save_png(checkwithoutdirecte, "fig-checkwithoutdirecteffect.png")

This is the confirmed model??

Below meets goodness-of-fit measures and shows both direct and indirect relationships. This is different from what was originally proposed.

# trying to tweak things based on information from chatbot that it's 
# a just-identified model. Removed   bd := b*d #WTGUILT


mediation2 <- ' 
  # direct effect
  ACCOUNT ~ d*SELFEFF + ep*WTANGER + fp*WTGUILT + gp*WTFEAR
  
  # mediator
  SELFEFF ~ a*WTANGER + c*WTFEAR
  
  # indirect effect of the white racial affects
  #ad := a*d #WTANGER
  #cd := c*d #WTFEAR
  
  # total effect
  total := a*d + c*d + ep + fp + gp
'
mediation2.res <- sem(mediation2, data = coredata)
#mediation2.res <- sem(mediation2,  data = subset(coredata, extremelyinfluential == FALSE))

summary(mediation2.res)

## lavaan 0.6.17 ended normally after 1 iteration
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         8
## 
##   Number of observations                            64
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.993
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   ACCOUNT ~                                           
##     SELFEFF    (d)    0.623    0.123    5.061    0.000
##     WTANGER   (ep)    0.299    0.096    3.103    0.002
##     WTGUILT   (fp)    0.087    0.046    1.890    0.059
##     WTFEAR    (gp)   -0.246    0.089   -2.768    0.006
##   SELFEFF ~                                           
##     WTANGER    (a)    0.184    0.094    1.961    0.050
##     WTFEAR     (c)   -0.239    0.085   -2.820    0.005
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .ACCOUNT           0.160    0.028    5.657    0.000
##    .SELFEFF           0.165    0.029    5.657    0.000
## 
## Defined Parameters:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     total             0.106    0.150    0.706    0.480

model_performance(mediation2.res)

# can also add covs=TRUE to show beween-emotions numbers

confirmedpath2 <- lavaanPlot(model = mediation2.res, node_options = list(shape = "box", fontname = "Helvetica"), edge_options = list(color = "grey"), coefs = TRUE, stars = c("regress"))

confirmedpath2

save_png(confirmedpath2, "fig-confirmedpath2.png")

3/3/24 Did comparisons of the same equation with: - outlier included and not: results are very similar, except effect size is markedly greater on several relationship with the outlier included - counselors only vs MFTs: Also similar, however WTGUILT was increased and was statistically significant; WTANGER was reduced effect AND no longer a statistically significant result.

Let’s do that same model with cfa() instead of sem()

fitcfarev <- cfa(mediation2, data = coredata)

summary(fitcfarev, fit.measures = TRUE, standardized=T,rsquare=T)

## lavaan 0.6.17 ended normally after 1 iteration
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         8
## 
##   Number of observations                            64
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.993
## 
## Model Test Baseline Model:
## 
##   Test statistic                                60.126
##   Degrees of freedom                                 7
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    1.000
##   Tucker-Lewis Index (TLI)                       1.132
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)                -65.256
##   Loglikelihood unrestricted model (H1)        -65.256
##                                                       
##   Akaike (AIC)                                 146.512
##   Bayesian (BIC)                               163.783
##   Sample-size adjusted Bayesian (SABIC)        138.605
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.000
##   90 Percent confidence interval - lower         0.000
##   90 Percent confidence interval - upper         0.000
##   P-value H_0: RMSEA <= 0.050                    0.993
##   P-value H_0: RMSEA >= 0.080                    0.006
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.000
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   ACCOUNT ~                                                             
##     SELFEFF    (d)    0.623    0.123    5.061    0.000    0.623    0.470
##     WTANGER   (ep)    0.299    0.096    3.103    0.002    0.299    0.276
##     WTGUILT   (fp)    0.087    0.046    1.890    0.059    0.087    0.164
##     WTFEAR    (gp)   -0.246    0.089   -2.768    0.006   -0.246   -0.251
##   SELFEFF ~                                                             
##     WTANGER    (a)    0.184    0.094    1.961    0.050    0.184    0.225
##     WTFEAR     (c)   -0.239    0.085   -2.820    0.005   -0.239   -0.323
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .ACCOUNT           0.160    0.028    5.657    0.000    0.160    0.464
##    .SELFEFF           0.165    0.029    5.657    0.000    0.165    0.841
## 
## R-Square:
##                    Estimate
##     ACCOUNT           0.536
##     SELFEFF           0.159
## 
## Defined Parameters:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##     total             0.106    0.150    0.706    0.480    0.106    0.143

pathresultcfarev <- lavaanPlot(model = fitcfarev, node_options = list(shape = "box", fontname = "Helvetica"), edge_options = list(color = "grey"), coefs = TRUE,  sig = 0.054)

pathresultcfarev

save_png(pathresultcfarev, "fig-fitcfarev.png")

model_performance(fitcfarev)

This also appears to not be a good fitting model. The CFI should be > 0.95 (hu) but the SRMR = 0 which indicates perfect fit. See https://ecologicalprocesses.springeropen.com/articles/10.1186/s13717-016-0063-3

So I am overfitting to the data.

What happens when I run all of this with the full dataset???

Maybe this is the actual relationship?

SELFEFF being influenced by ACCOUNTABILITY and ACCOUNTABILITY being influenced by the emotions?

pathmodel2 <- '
SELFEFF ~ ACCOUNT
ACCOUNT ~ WTFEAR + WTGUILT + WTANGER
'
fit2 <- cfa(pathmodel2, data = coredata)

summary(fit2, fit.measures = TRUE, standardized=T,rsquare=T)

## lavaan 0.6.17 ended normally after 2 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         6
## 
##   Number of observations                            64
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 1.896
##   Degrees of freedom                                 3
##   P-value (Chi-square)                           0.594
## 
## Model Test Baseline Model:
## 
##   Test statistic                                60.126
##   Degrees of freedom                                 7
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    1.000
##   Tucker-Lewis Index (TLI)                       1.048
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)                -66.204
##   Loglikelihood unrestricted model (H1)        -65.256
##                                                       
##   Akaike (AIC)                                 144.408
##   Bayesian (BIC)                               157.362
##   Sample-size adjusted Bayesian (SABIC)        138.478
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.000
##   90 Percent confidence interval - lower         0.000
##   90 Percent confidence interval - upper         0.177
##   P-value H_0: RMSEA <= 0.050                    0.654
##   P-value H_0: RMSEA >= 0.080                    0.269
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.037
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   SELFEFF ~                                                             
##     ACCOUNT           0.465    0.074    6.276    0.000    0.465    0.617
##   ACCOUNT ~                                                             
##     WTFEAR           -0.395    0.099   -3.976    0.000   -0.395   -0.403
##     WTGUILT           0.087    0.055    1.592    0.111    0.087    0.164
##     WTANGER           0.414    0.111    3.731    0.000    0.414    0.382
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .SELFEFF           0.121    0.021    5.657    0.000    0.121    0.619
##    .ACCOUNT           0.224    0.040    5.657    0.000    0.224    0.650
## 
## R-Square:
##                    Estimate
##     SELFEFF           0.381
##     ACCOUNT           0.350

skip for now

# from https://advstats.psychstat.org/book/path/index.php

library(lavaan)
#usedata('coredata')
# mediation <- '
# + ACCOUNT ~ b*SELFEFF + cp*WTANGER
# + SELFEFF ~ a*WTANGER
# + ab := a*b
# + total := a*b + cp
# + '
# 
# mediation.res<-sem(mediation, data=coredata)
# summary(mediation.res)

mymodel <- '
  # regressions
    ACCOUNT ~ SELFEFF
    SELFEFF ~ WTGUILT + WTANGER + WTFEAR
  # residual correlations
    # y1 ~~ y5
    # y2 ~~ y4 + y6
    # y3 ~~ y7
    # y4 ~~ y8
    # y6 ~~ y8
'

fit <- sem(mymodel, data = coredata)
summary(fit, standardized = TRUE)

And try it with all 3

semselfeff <- ' 
  # direct effect
  SELFEFF ~ d*ACCOUNT + ep*WTANGER + fp*WTGUILT + gp*WTFEAR
  
  # mediator
  ACCOUNT ~ a*WTANGER + b*WTGUILT + c*WTFEAR
  
  # indirect effect of the white racial affects
  ad := a*d #WTANGER
  bd := b*d #WTGUILT
  cd := c*d #WTFEAR
  
  # total effect
  total := a*d + b*d + c*d + ep + fp + gp
'
semselfeff.res <- sem(semselfeff, data = coredata)

summary(semselfeff.res)

## lavaan 0.6.17 ended normally after 1 iteration
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         9
## 
##   Number of observations                            64
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   SELFEFF ~                                           
##     ACCOUNT    (d)    0.458    0.091    5.061    0.000
##     WTANGER   (ep)   -0.006    0.089   -0.066    0.947
##     WTGUILT   (fp)   -0.040    0.040   -0.999    0.318
##     WTFEAR    (gp)   -0.058    0.080   -0.716    0.474
##   ACCOUNT ~                                           
##     WTANGER    (a)    0.414    0.111    3.731    0.000
##     WTGUILT    (b)    0.087    0.055    1.592    0.111
##     WTFEAR     (c)   -0.395    0.099   -3.976    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .SELFEFF           0.118    0.021    5.657    0.000
##    .ACCOUNT           0.224    0.040    5.657    0.000
## 
## Defined Parameters:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     ad                0.190    0.063    3.003    0.003
##     bd                0.040    0.026    1.519    0.129
##     cd               -0.181    0.058   -3.127    0.002
##     total            -0.055    0.129   -0.425    0.670

selfeffpath <- lavaanPlot(model = semselfeff.res, node_options = list(shape = "box", fontname = "Helvetica"), edge_options = list(color = "grey"), coefs = TRUE,covs = TRUE, stars = c("regress"))

selfeffpath

save_png(selfeffpath, "fig-selfeffpath.png")

model_performance(semselfeff.res)

Nope, it’s a just-identified model.

But, some remarks on what it showed: WTANGER and WTGUILT totally not showing direct effects on SELFEFF. But they DO show direct effects on ACCOUNT through the original proposed model (per regressions). WTFEAR is the only one that has a valid direct effect, and it’s smaller in this model (-0.128, compared to the -0.24 we were seeing with the ACCOUNT DV models). What this does seem to emphasize though: The three white racial affects have a DIRECT EFFECT on ACCOUNT.

So my thinking was, these racialized emotions affect people’s ability to feel like they have self-efficacy around antiracism, but really they are affecting people’s ability to even feel ACCOUNTABLE for antiracism.

WTFEAR makes your accountability go DOWN.

OMG THAT SEEMS HUGE?!??

library(readr)

if (counselingprofessionalsonly == TRUE) {
 write_csv(cleandata, "data\\output-counselors-cleandata.csv") 
 write_csv(coredata, "data\\output-counselors-coredata.csv")
}
write_csv(cleandata, "data\\output-cleandata.csv")
write_csv(coredata, "data\\output-coredata.csv")
write_csv(pcrw_subset_recoded, "data\\pcrw.csv")
write_csv(justvars, "data\\justvars.csv")

#library(webshot)
flextable(lm(ACCOUNT ~ WTFEAR + WTGUILT + WTANGER + SELFEFF, data = coredata)) %>%
  set_caption(caption = "Table 1") %>% 
  save_as_image("tmp1.png")

flextable(lm(ACCOUNT ~ WTFEAR + WTGUILT + WTANGER + SELFEFF, data = subset(coredata, extremelyinfluential == FALSE))) %>% 
  set_caption(caption = "Table 2") %>% save_as_image("tmp2.png")
knitr::include_graphics(c("tmp1.png", "tmp2.png"))

library(stargazer)

# this library seems to suck

modelwithextreme = lm(ACCOUNT ~ WTFEAR + WTGUILT + WTANGER + SELFEFF, data = coredata) 
modelwithoutextreme = lm(ACCOUNT ~ WTFEAR + WTGUILT + WTANGER + SELFEFF, data = subset(coredata, extremelyinfluential == FALSE)) 
#stargazer(modelwithextreme, modelwithoutextreme, title="Regression Results", align=TRUE,type="html")

stargazer(modelwithextreme, modelwithoutextreme, title="Regression Results")

Aguinis, H., Gottfredson, R. K., & Joo, H. (2013). Best-practice recommendations for defining, identifying, and handling outliers. Organizational Research Methods, 16(2), 270–301. https://doi.org/10.1177/1094428112470848

Ato García, M., Vallejo Seco, G., & Ato Lozano, E. (2014). Classical and causal inference approaches to statistical mediation analysis. Psicothema, 26.2, 252–259. https://doi.org/10.7334/psicothema2013.314

Flamez, B., Lenz, A. S., & Balkin, R. S. (2017). A counselor’s guide to the dissertation process: Where to start and how to finish. American Counseling Association.

Fritz, M. S., & MacKinnon, D. P. (2007). Required sample size to detect the mediated effect. Psychological Science, 18(3), 233–239. https://doi.org/10.1111/j.1467-9280.2007.01882.x

Hair, J. F., Ringle, C. M., & Sarstedt, M. (2011). PLS-SEM: Indeed a silver bullet. Journal of Marketing Theory and Practice, 19(2), 139–152. https://doi.org/10.2753/MTP1069-6679190202

Soble, J. R., Spanierman, L. B., & Liao, H.-Y. (2011). Effects of a brief video intervention on White university students’ racial attitudes. Journal of Counseling Psychology, 58(1), 151–157. https://doi.org/10.1037/a0021158

Spanierman, L. B., Beard, J. C., & Todd, N. R. (2012). White men’s fears, white women’s tears: Examining gender differences in racial affect types. Sex Roles, 67(3-4), 174–186. https://doi.org/10.1007/s11199-012-0162-2

Spanierman, L. B., Poteat, V. P., Beer, A. M., & Armstrong, P. I. (2006). Psychosocial costs of racism to whites: Exploring patterns through cluster analysis. Journal of Counseling Psychology, 53(4), 434–441. https://doi.org/10.1037/0022-0167.53.4.434

Yao, X., Yu, L., Shen, Y., Kang, Z., & Wang, X. (2021). The role of self-efficacy in mediating between professional identity and self-reported competence among nursing students in the internship period: A quantitative study. Nurse Education in Practice, 57, 103252. https://doi.org/10.1016/j.nepr.2021.103252

Lisa’s Preliminary Data Analysis

Overview of the Study

Getting started: Get the data in and clean it up

Set up R environment

Set variables to be used to filter the dataset

Load libraries

Read in CSV file

Fix up the formats

Shorten actual values

Data Clean-up

Exclude incomplete and invalid responses

Examine alternative fill-in answers for inclusion/exclusion

Remove psychologists

Number of valid & complete observations (“cleandata” dataset)

Descriptive Statistics

Calculate scores from instruments

Quick: Run an ANOVA against the profID field

Report participants’ details

Experimenting with apaTables

PRACTICING WITH NICE_TABLE

Limit Dataset to Desired Participant Attributes

Number of observations in subset to be analyzed

K-means clustering per Spanierman

Consistency checks - Cronbach’s alpha

Omega output by psych package (SKIP)

PCRW Subscales

Modified WPAS instrument

Look for outliers

Examine sample for model assumptions

Histograms on the sample variables

Analysis -

Descriptive Statistics NEW

this was deleted from the manuscript 2-19

Messiness below

Correlation matrix with values rounded to 3 decimals

Basic scatter plots

RANDOM

Basic linear regression on ACCOUNT

Basic linear regression without WTGUILT

Examining the residuals plots for original attempted model

Identify Outliers

Correlations using Kendall’s tau

Researcher decisions

Evaluation of data against model assumptions after removing extremely influential point

Test for multicollinearity of the 3 white racial affect independent variables

Simple model: Only SELFEFF on ACCOUNT

Goodness of Fit: Examine R-squared

Non-parametric tests considered

One sample t-test

Spearman’s rho

Throwing shit at the wall

Quantile Regression

Nonparametric regression (fancy stuff)

GLMULTI package

Nadaray Watson kernel function

My understanding so far

Linear regression of the white racial affects on SELFEFF

Linear regression of the white racial affects on ACCOUNT

Linear regression of ACCOUNT with just SELFEFF and WTFEAR

Variable Selection

Sequential Estimation

Stepwise Regression

Partial Mediation Model

Robust Regression

ANOVA

ACCOUNT and sexuality

ACCOUNT and age group

ACCOUNT and region

ACCOUNT and SES (class)

ACCOUNT and counseling experience (years practicing)

Bootstrap

Run the LM against the boostrapped data

Robust ??

Poisson

Other types of regressions I’ve tried to run

Generalized Linear Model

Two-Stage Least-Squares Regression using ivreg

Statistical Mediation Analysis (invalid? apparently underpowered)

Rough conclusions based on this output

Path Analysis (invalid for this study?? underpowered?)