1 Introduction

For this analysis, two validated and reliable survey instruments, the Self-Compassion Scale and the Gratitude Questionnaire, were used to measure levels of self-compassion and gratitude in students. Specifically, we want to see how the self-compassion and of BSW and MSW students in a Social Work Program at a regional University ties in to other demographic variables. This can help others to study self-compassion in the students and how it connects to their gratitude, self-care, and success in the social work field. This, in turn, helps to provide students self-care practices during training.

The survey data set has three main components. First, there is the self-compassion scale instrument, composed of 12 self-compassion question variables. Second, there is the gratitude questionnaire instrument, composed of 6 gratitude question variables. Lastly, there is a set of demographic questions that address different aspects of each student, such as gender, race, and political affiliation. This data set will go through some data management before it is used for analysis. We will impute missing values in each item of the self-compassion and gratitude survey instruments. Then, we will extract the first two principal component (PC) scores from each survey instrument using principal component analysis (PCA). As for missing values in the demographic variables, they are replaced with 99. These demographic variables will then be re-categorized to fix imbalance categories. These four principal components and re-categorized demographic variables will make up the final analytic data set to be use for analysis.

With this final data set, we will run exploratory data analysis (EDA) on descriptive statistics of the data. Then, we will build a candidate linear analysis of covariance (ANCOVA) regression model and work towards a final model. Hopefully, this final model can to be used to see if the self-compassion of students are affected by the gratitude index and demographic variables.

2 Data Management and Analytic Data Set Creation

2.1 Handling Missing Values

We will impute missing values in each of the items of both survey instruments by putting in the mode of the corresponding survey items where values are missing. Since there are only a few missing values in this survey instrument, this will not impact the principal component analysis for either. We will later create indexes of the two instruments separately to assemble the information in these two survey data sets.

For the gratitude questionnaire, Likert scales of the Q3_3 and Q3_6 were in reverse order in this design, so we will transform back the usual order and make a new data set using the same variable names.

For the demographic variables, we will impute missing values by inserting the value of 99 where values are missing.

2.2 Definitions of Demographic Variables

To fix the imbalance categories, we will re-categorize the demographic variables. The size of the data set is close to 120, and imputing missing values and fixing imbalance categories meaningfully is critical for sample size and statistical power of future analyses. The following modifications were made to the demographic variables, which will be used in the subsequent modeling.

grp.age = Q8_1 (age in years): 1 = (3,23], 2 = [24, 30], 3 = [31, 59]
grp.edu = Q8_2 (years of education completed): 1 = [0,15] associate, 2 = [15.5,18.5] bachelor, 3 = [19, 25]  advdegree
grp.empl = Q8_3 (years of professional employment): 1 = [0,5] entry, 2 = [5.5,10] junior, 3 = [10.5, 35] senior
kid.num = Q8_5 (number of living children): 1 = (0) No child, 2 = at least one child
home.size = Q8_6 (number of persons who live in your household): 1 = (1), 2 = (2), 3 = 3 or more
gender = Q9 (gender): 1 = (1) male, 2 = (2) female, 3 = (3,4,5,6,7) other sexual orientation
race = Q11 (race/ethnicity): 1 = (1) white, 2 = (2,3,4,5) non-white
marital.st = Q13 (current martial status): 1 = (1,3,4,5) single or was once married, 2 = (2,6) married or has a partner
disability = Q14 (Do they have a disability?): 1 = (1) Yes, 2 = (2) No
religion = Q15 (religious affiliation): 1 = (1,2,3,4,5,6,7,8) traditional religions, 2 = (9) no religion, 3 = (10,11,12) non-traditional religion or non-specific answer
sexual.orient = Q16 (sexual orientation): 1 = (1,2,6,8) Homosexual, 2 = (3,4,5) heterosexual or bisexual, 3 = (7,9,10) sexual orientation not fully specified
poli.affil = Q17 (political affiliation): 1 = (1,2,3) Republican, 2 = (4) Independent, 3 = (5,6,7) Democrat
SW.program = Q18 (Present edducational level at the current institution): 1 = (1) BSW, 2 = (2) MSW
urbanity = Q19 (Do you live in an urban, suburban, or rural area?): 1 = (1) Urban, 2 = (2) Rural, 3 = (3) Suburban
spirituality = Q20 (Where do you see yourself on this sscale in terms of spirituality?): 1 = (1,2,3) low, 2 = (4) moderate, 3 = (5,6,7) high

2.3 Combining Survey Items: PCA

Principal component analysis (PCA) is a statistical procedure of dimension reduction that allows a user to combine variables of different magnitudes based on possible correlation between them. A new variable is defined using the variable believed to be correlated. Mainly, the first two principal components, P1 and P2, drawn from the analysis, are used as response variables in making predictive models of linear transformations between coordinate systems. The first principal component tends to explain most of the variation in the data cloud while the second principal component explains a smaller amount.

Now that the demographic variables have been re-categorized, we will use principal component analysis (PCA) to aggregate the information of the 12-item self-compassion instrument and 6-item gratitude instrument. The first two principal factors we get from each survey instrument will be used in regression analysis to reflect association between self-compassion and gratitude scores through creating new predictive models.

2.3.1 Self-Compassion Scores

We will use the PCA method here to reduce self-compassion dimensions from 12 to a smaller number. Then, we can find factor loadings of the principal component analyses and write a system of linear transformation with them. Only the first two PCs of this analysis will be used as response variables for regression modeling. The PC scores of these two components will act as the values of the new transformed variables.

Factor loadings of the PCA
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12
Q2_1 0.29 -0.44 0.22 -0.09 -0.23 0.14 -0.03 0.05 0.56 0.18 -0.05 -0.49
Q2_2 -0.28 -0.34 -0.30 -0.07 0.45 0.33 -0.09 -0.08 0.43 -0.18 0.28 0.29
Q2_3 -0.27 -0.24 0.11 0.53 -0.20 0.43 -0.07 -0.44 -0.26 -0.14 -0.21 -0.13
Q2_4 0.28 -0.36 -0.25 -0.35 0.44 0.02 -0.03 -0.17 -0.51 0.05 -0.17 -0.32
Q2_5 -0.25 -0.28 -0.32 0.11 -0.15 -0.50 0.63 -0.15 0.06 -0.11 0.07 -0.15
Q2_6 -0.32 -0.17 -0.21 0.20 -0.02 0.18 0.03 0.58 -0.21 0.60 0.07 -0.08
Q2_7 -0.21 -0.29 0.61 0.17 0.42 -0.32 -0.04 0.33 -0.05 -0.25 -0.13 -0.02
Q2_8 0.32 -0.32 0.28 0.12 0.01 -0.10 0.15 -0.29 -0.11 0.46 0.26 0.54
Q2_9 0.33 -0.20 -0.07 -0.03 -0.19 0.34 0.40 0.42 -0.09 -0.37 -0.30 0.35
Q2_10 -0.24 -0.39 -0.02 -0.38 -0.52 -0.19 -0.46 0.07 -0.20 -0.17 0.12 0.18
Q2_11 0.36 -0.02 -0.09 0.36 -0.03 0.00 -0.11 0.19 -0.19 -0.33 0.70 -0.23
Q2_12 0.27 -0.11 -0.42 0.45 0.04 -0.36 -0.43 0.06 0.18 0.03 -0.40 0.16

The explicit expression of the predictive system of the first two PCs is given by

\[ \begin{aligned} SC_1 & = 0.29*Q2_1 - 0.28*Q2_2 - 0.27*Q2_3 + 0.28*Q2_4 - 0.25*Q2_5 - 0.32*Q2_6 - 0.21*Q2_7 + 0.32*Q2_8 + 0.33*Q2_9 - 0.24*Q2_10 + 0.36*Q2_11 + 0.27*Q2_12 \\ SC_2 & = -0.44*Q2_1 - 0.34*Q2_2 - 0.24*Q2_3 - 0.36*Q2_4 -0.28*Q2_5 - 0.17*Q2_6 -0.29*Q2_7 - 0.32*Q2_8 - 0.20*Q2_9 -0.39*Q2_10 - 0.02*Q2_11 - 0.11*Q2_12 \\ \end{aligned} \]

The importance of the principal components is given by

The importance of each principal component
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12
Standard deviation 2.22516 1.151378 0.9866141 0.9041854 0.8395367 0.8272044 0.79413 0.7335469 0.6435139 0.6285915 0.5548848 0.5070264
Proportion of Variance 0.41261 0.110470 0.0811200 0.0681300 0.0587400 0.0570200 0.05255 0.0448400 0.0345100 0.0329300 0.0256600 0.0214200
Cumulative Proportion 0.41261 0.523080 0.6042000 0.6723300 0.7310700 0.7880900 0.84064 0.8854800 0.9199900 0.9529200 0.9785800 1.0000000

From the above table, it appears that the first principal component explains about \(41.26\%\) of the total variation. The first two principal components put together explain about \(52.3\%\) of the total variation. If we use the first two PCs in the data analysis, we would loose about \(47.7\%\) of the information.

2.3.2 Gratitude Scores

We will use the PCA method to reduce gratitude dimensions from 6 to a smaller number. Then, we can find factor loadings of the principal component analyses and write a system of linear transformation with them. Only the first two PCs of this analysis will be used for the regression modeling. The PC scores of these two components will act as the values of the new transformed variables.

Factor loadings of the PCA
PC1 PC2 PC3 PC4 PC5 PC6
Q3_1 0.44 -0.21 0.20 -0.60 -0.18 0.58
Q3_2 0.46 -0.12 -0.17 -0.42 0.34 -0.67
Q3_3 0.38 0.56 -0.38 0.07 -0.62 -0.09
Q3_4 0.41 -0.24 -0.55 0.45 0.37 0.37
Q3_5 0.39 -0.47 0.45 0.47 -0.37 -0.25
Q3_6 0.36 0.60 0.53 0.20 0.43 0.10

The explicit expression of the predictive system of PC is given by

\[ \begin{aligned} G_1 & = 0.44*Q3_1 + 0.46*Q3_2 + 0.38*Q3_3 + 0.41*Q3_4 + 0.39*Q3_5 + 0.36*Q3_6 \\ G_2 & = -0.21*Q3_1 - 0.12*Q3_2 + 0.56*Q3_3 - 0.24*Q3_4 -0.47*Q3_5 + 0.60*Q3_6 \\ \end{aligned} \]

The importance of the principal components is given by

The importance of each principal component
PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 1.778559 0.9131658 0.8335208 0.7380542 0.6666034 0.5648139
Proportion of Variance 0.527210 0.1389800 0.1157900 0.0907900 0.0740600 0.0531700
Cumulative Proportion 0.527210 0.6661900 0.7819800 0.8727700 0.9468300 1.0000000

From the above table, it appears that the first principal component explains about \(52.72\%\) of the total variation. The first two principal components put together explain about \(13.9\%\) of the total variation. If we use the first two PCs in the data analysis, we would loose about \(66.62\%\) of the information.

2.4 Final Analytic Data Set

With missing values imputed, the final analytic data set will use the first two transformed PC score variables, for both the self-compassion and gratitude indexes, as response variables. The rest of the variables consist of the set of modified, or re-categorized, demographic variables with missing values replaced with 99.

3 Data Analysis

With this final analytic data set, we will some perform exploratory data analysis (EDA) and observe certain descriptive statistics and histograms of the variables in the data set. Then, we will start building a candidate linear regression model. We will use residual diagnostics to see if a linear ANCOVA regression model is appropriate for this analysis. If the diagnostics of this candidate model are not violated, we will use this model as the final model for determining how self-compassion of social work students are affected by the gratitude index and demographic variables.

3.1 Exploratory Data Analysis (EDA)

We will start by looking at frequency tables showing the different categories that were made for each demographic variable as well as how many observations fell in each category.

## $Q8.1
## 
##  3 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 35 36 39 40 42 43 44 45 46 
##  1  2  3  4  9  5  8  7  5  9  4  4  5  4  5  7  3  1  1  1  1  1  1  2  1  2 
## 47 48 50 53 55 59 
##  1  1  3  1  1  1 
## 
## $Q8.2
## 
##    1    3    4    5    6    7    8   13   14 14.5   15   16   17   18 18.5   19 
##    1    1    1    2    1    1    1    4    6    1    7   19   18   18    1   10 
##   20   21   22   25 
##    7    2    2    1 
## 
## $Q8.3
## 
##    0    1  1.5    2  2.5    3  3.5    4  4.5    5  5.5    6    7  7.5    8  8.5 
##    8   10    3    4    5    3    4    3    4    7    1    6    8    5    2    1 
##    9  9.5   10 10.5   11   12   16   20   21   23   24   25   26   30   34   35 
##    2    1    4    1    1    4    2    6    1    2    1    1    1    1    1    1 
## 
## $Q8.5
## 
##  0  1  2  3  4  5  6  7 
## 73 10 12  4  1  2  1  1 
## 
## $Q8.6
## 
##  0  1  2  3  4  5  6  7  9 
##  2 18 30 23 16  9  3  2  1 
## 
## $Q.9
## 
##  1  2  3  5  7 
##  7 94  1  1  1 
## 
## $Q.11
## 
##  1  2  5 
## 95  5  4 
## 
## $Q.13
## 
##  1  2  3  4  6 
## 52 29  2  2 19 
## 
## $Q.14
## 
##  1  2 
## 23 81 
## 
## $Q.15
## 
##  1  2  3  6  7  8  9 10 11 99 
##  4  9  3  1  6 10 46  5 17  3 
## 
## $Q.16
## 
##  1  2  3  4  5  6  7 10 
##  1  2 13 78  2  5  2  1 
## 
## $Q.17
## 
##  1  2  3  4  5  6  7 99 
##  3  2  4 31  2 29 30  3 
## 
## $Q.18
## 
##  1  2 99 
## 41 62  1 
## 
## $Q.19
## 
##  1  2  3 
## 33 27 44 
## 
## $Q.20
## 
##  1  2  3  4  5  6  7 
##  8 11 20 28 14 12 11

We will also make histograms, with normal curves, of the four transformed PC variables.

The densities of both self-compassion PC index histograms follows the Normal distribution pattern. These variables are symmetrical and unimodal, with no outliers.

The gratitude index for the second principal component follows the Normal distribution pattern. This variable is symmetrical and unimodal, with no outliers. As for the first principal component, it appears to be skewed left a little. However, this probably will not be enough to greatly affect the data. Yet, it would probably be best to not use this variable as the response for the model.

3.2 Regression Models

An ANCOVA model will be mainly used for this regression analysis. The response variable, sc.idx.1, is numerical. The explanatory variables consist of all the demographic variables, which are categorical, and the gr.idx.1 variable, which is numerical. It was hard to find evidence that any interaction terms were significant, so they were excluded from the model. The second PC index variables for self-compassion and gratitude were also removed since it seemed they didn’t hold enough significance to impact the model.

  • The Residual vs. Fitted graph follows a straight line pattern and points appear to be spread out evenly across the graph. Therefore, the assumption of linearity is not violated.

  • Most points on the Normal Q-Q plot are on the line, indicating the normality assumption is not violated.

  • The Scale-Location graph has no apparent pattern and the line is mostly straight. Therefore, the assumption of homogeneity of variance is not violated.

  • These residual plots indicate no serious violations of the model assumption. Plus, all student samples were taken independently. Therefore, this ANCOVA model will be used as the final model.

## Analysis of Variance Table
## 
## Response: sc.idx.1
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## grp.age        2  37.80  18.902  4.2798 0.0171898 *  
## grp.edu        2  18.11   9.053  2.0497 0.1355635    
## grp.empl       2   0.51   0.254  0.0575 0.9441862    
## kid.num        1   2.51   2.509  0.5680 0.4532985    
## home.size      1   7.58   7.575  1.7151 0.1941217    
## gender         2  10.53   5.266  1.1924 0.3088967    
## race           1   2.52   2.521  0.5707 0.4522184    
## marital.st     1   0.03   0.026  0.0058 0.9395648    
## disability     1   3.51   3.514  0.7957 0.3750998    
## religion       2   0.81   0.404  0.0914 0.9127818    
## sexual.orient  2   1.60   0.800  0.1811 0.8346621    
## poli.affil     2   1.97   0.987  0.2234 0.8002879    
## SW.program     1   0.73   0.732  0.1657 0.6850971    
## urbanity       2   4.42   2.210  0.5005 0.6081610    
## spirituality   1  12.73  12.734  2.8831 0.0934517 .  
## gr.idx.1       1  55.71  55.708 12.6131 0.0006493 ***
## Residuals     79 348.92   4.417                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From this ANCOVA model, it appeared that the only variables that had significant p-values less than 0.05 are gr.idx.1, the gratitude PC1 index, and grp.age, age in years. Some trials were performed separately where variables with p-values greater than 0.8 were removed, but no other variables became significant after these ones were removed. So, we will say the regression model for the self-compassion index should only include the grp.age and gr.idx.1 variables.

## 
## Call:
## lm(formula = sc.idx.1 ~ grp.age + gr.idx.1, data = final.analytic.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8732 -1.5338 -0.1838  1.3321  4.2507 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -0.0596     0.3525  -0.169    0.866    
## grp.age[24,30]   0.7353     0.4791   1.535    0.128    
## grp.age[30,99]  -0.6395     0.4908  -1.303    0.196    
## gr.idx.1        -0.4906     0.1107  -4.433 2.39e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.987 on 100 degrees of freedom
## Multiple R-squared:  0.2262, Adjusted R-squared:  0.203 
## F-statistic: 9.743 on 3 and 100 DF,  p-value: 1.067e-05

Based off the output of the regression coefficients, the final model is as follows:

\[ \begin{aligned} sc.idx.1 = -0.0596 + 0.7353*(grp.age[24,30]) - 0.6395*(grp.age[30,99]) - 0.4906*(gr.idx.1) \end{aligned} \]

4 Conclusion

From this linear ANCOVA regression model analysis, it appears that the gratitude index and the age, in years, of BSW and MSW students in the Social Work Program directly affects their self-compassion index. It was uncertain, however, if any more of the demographic variables used in the final analytic data set held any significance towards the self-compassion index. It was hard to fit all of the variables into a single model. Plus, trying to remove variables with large p-values did not seem to change which variables had significant p-values and which ones did not.

The data of the final analytic data set was properly managed. Using just the principal component indexes and re-categorized demographic variables helped the analysis run more smoothly. The ANCOVA regression model was definitely the best choice for both categorical an numerical explanatory variables. However, since the great number of demographic variables impacted the ANCOVA regression model, perhaps using a smaller number of these variables would have been better. This could be done by using less variables in the data set or asking the client if any of the variables could have been excluded from the analysis.