The goal of this research project is to measure the level of self-compassion and the self-care of BSW and MSW students in a Social Work Program at a regional university, along with generating a logistic regression model for our data set. We will be utilizing both the self-compassion scale and gratitude questionnaire. We hope to see how the SC of the students correlates to other independent variables such as religiosity, spirituality, and more.
The original survey data has three components:
These three components have different portions of missing values. We split the original data set into three different subsets of data and impute the missing values related to the self-compassion and gratitude data based on the survey instruments. Since there are only a few missing values, we replace the missing values in each survey question with the mode of the associated survey item. We then create indexes of the two instruments separately to aggregate the information in the two data sets.
In the original data set, the 12 variables named Q2_1, Q2_2, …, Q2_12 are the compassion based variables. We impute the missing values for these variables by replacing the missing value on each of the 12 items with the mode of the corresponding survey items. Since there are only a few missing values in this instrument, this imputation will not impact future analyses.
The 6 variables named Q3_1, Q3_2, …, Q3_6 are the gratitude based variables. We use the same mode imputation for the gratitude based variables as we did for the compassion based variables. The gratitude questionnaire has few missing values as well, so the imputation will not impact future analysis. However, before the imputation can begin, we must reverse the Likert scales for Q3_3 and Q3_6 since there were in the opposite order by design. Once this reversing is complete, the imputation can proceed.
The demographic variables are suffering from both missing values and imbalanced categories. Since the data set is on the smaller side of 120, we must input the missing values in a meaningful way to maintain both sample size and statistical power of future analyses. There were about 15 entries in the data set that were missing its demographic information entirely, thus they were deleted from the final data.
A few missing values were for years of education and employment that were imputed using the auxiliary information of variables of age, the years of education, and the length of employment.
We re-coded the demographic variables and then modified them as per the parameters below:
Here we want to use the PCA method. Principal component analysis (PCA) is one of the important dimension reduction methods in the areas of data science and machine learning. Depending on which optimization is used, you can call it a statistical dimension reduction method. We want to use the PCA method to reduce the dimensions from a certain number of numerical variables to a smaller number. The R function prcomp() to the factor loadings associated the associated numerical variables. We are using this method for the survey analysis since our two sets (Self-Compassion and Gratitude) are set with only numerical values.
Next, we find the factor loading of the above fitted PCA. We can write an explicit system of linear transformation by using the loadings.
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | PC11 | PC12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Q2_1 | 0.31 | -0.41 | 0.22 | 0.00 | 0.12 | -0.04 | 0.20 | 0.09 | 0.17 | 0.55 | 0.45 | 0.30 |
| Q2_2 | -0.28 | -0.32 | -0.21 | 0.13 | -0.55 | 0.25 | 0.03 | 0.08 | 0.52 | 0.21 | -0.25 | -0.10 |
| Q2_3 | -0.26 | -0.21 | 0.09 | -0.61 | -0.04 | 0.37 | 0.41 | -0.34 | -0.10 | -0.21 | 0.17 | 0.01 |
| Q2_4 | 0.28 | -0.33 | -0.12 | 0.44 | -0.36 | 0.24 | -0.09 | -0.18 | -0.37 | -0.38 | 0.22 | 0.23 |
| Q2_5 | -0.23 | -0.32 | -0.40 | -0.12 | 0.47 | 0.13 | -0.59 | -0.12 | 0.13 | -0.02 | 0.09 | 0.21 |
| Q2_6 | -0.32 | -0.20 | -0.24 | -0.17 | -0.15 | -0.17 | 0.05 | 0.51 | -0.61 | 0.24 | -0.09 | 0.13 |
| Q2_7 | -0.21 | -0.21 | 0.66 | -0.17 | -0.28 | -0.24 | -0.50 | 0.11 | 0.05 | -0.20 | 0.13 | -0.03 |
| Q2_8 | 0.32 | -0.34 | 0.24 | -0.09 | 0.12 | 0.14 | -0.13 | -0.25 | -0.25 | 0.24 | -0.68 | -0.15 |
| Q2_9 | 0.34 | -0.22 | -0.08 | -0.14 | 0.16 | 0.25 | -0.01 | 0.56 | 0.07 | -0.27 | 0.14 | -0.56 |
| Q2_10 | -0.22 | -0.47 | 0.01 | 0.28 | 0.29 | -0.50 | 0.39 | -0.05 | 0.11 | -0.33 | -0.15 | -0.09 |
| Q2_11 | 0.36 | 0.01 | -0.11 | -0.36 | -0.11 | -0.15 | 0.05 | 0.24 | 0.27 | -0.34 | -0.32 | 0.58 |
| Q2_12 | 0.28 | -0.09 | -0.40 | -0.32 | -0.31 | -0.53 | -0.12 | -0.34 | -0.01 | 0.10 | 0.17 | -0.33 |
The importance of the principal components is given in the table below.
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | PC11 | PC12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Standard deviation | 2.182178 | 1.200923 | 0.9891865 | 0.9015108 | 0.8534749 | 0.8205392 | 0.8094385 | 0.7590303 | 0.6544726 | 0.6094223 | 0.5653666 | 0.5022745 |
| Proportion of Variance | 0.396830 | 0.120180 | 0.0815400 | 0.0677300 | 0.0607000 | 0.0561100 | 0.0546000 | 0.0480100 | 0.0356900 | 0.0309500 | 0.0266400 | 0.0210200 |
| Cumulative Proportion | 0.396830 | 0.517010 | 0.5985500 | 0.6662800 | 0.7269800 | 0.7830900 | 0.8376900 | 0.8857000 | 0.9213900 | 0.9523400 | 0.9789800 | 1.0000000 |
From the above table, we can see that the first PC for self_compassion explains about \(73.33\%\) of the variation. But we first two principal components explain about \(96\%\) of the total variation. In the data analysis, you only need to use the first two PCs that loose about \(4\%\) of the information.
Next, we find the factor loading of the above fitted PCA. We can write an explicit system of linear transformation by using the loadings.
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | |
|---|---|---|---|---|---|---|
| Q3_1 | 0.43 | -0.21 | 0.38 | -0.57 | 0.11 | -0.54 |
| Q3_2 | 0.46 | -0.03 | -0.10 | -0.21 | -0.77 | 0.37 |
| Q3_3 | -0.40 | -0.49 | 0.38 | 0.32 | -0.51 | -0.30 |
| Q3_4 | 0.41 | -0.22 | -0.61 | 0.43 | -0.01 | -0.47 |
| Q3_5 | 0.40 | -0.54 | 0.30 | 0.32 | 0.36 | 0.48 |
| Q3_6 | -0.35 | -0.61 | -0.48 | -0.49 | 0.09 | 0.17 |
The importance of the principal components is given in the table
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | |
|---|---|---|---|---|---|---|
| Standard deviation | 1.76868 | 0.9192197 | 0.8332011 | 0.7598448 | 0.6504357 | 0.5763269 |
| Proportion of Variance | 0.52137 | 0.1408300 | 0.1157000 | 0.0962300 | 0.0705100 | 0.0553600 |
| Cumulative Proportion | 0.52137 | 0.6622000 | 0.7779000 | 0.8741300 | 0.9446400 | 1.0000000 |
From the above table, we can see that the first PC for gratitude explains about \(52.14\%\) of the variation. But the first two principal components explain about \(66\%\) of the total variation.
We plan to use 2 PCAs in the data modeling section. The predictive principle scores are values of the new transformed variables. We can choose first few principal components to use as response variables to do relevant modeling. The following code extracts the PC scores from the PCA procedure. These scores are the values of the new transformed variables. They can be used as response or predictor variables in statistical models.
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | PC11 | PC12 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.5195498 | 0.2082051 | -1.3996389 | 0.7628791 | 0.7692141 | -0.2265131 | -1.3459741 | 0.8383519 | 1.4894686 | 0.0448812 | -0.7084262 | -0.3700073 |
| 2.1893405 | 0.5978567 | 0.3000878 | 0.6983834 | 0.0922666 | -0.2587521 | 0.8221165 | 0.2836054 | -0.7295202 | -0.8817436 | 0.6041537 | -0.0601710 |
We plan to use 2 PCAs in the data modeling section. The predictive principle scores are values of the new transformed variables. We can choose first few principal components to use as response variables to do relevant modeling. The following code extracts the PC scores from the PCA procedure. These scores are the values of the new transformed variables. They can be used as response or predictor variables in statistical models.
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 |
|---|---|---|---|---|---|
| -0.894490 | 0.2570108 | -0.5275107 | -0.8914780 | 0.5038518 | 1.073425 |
| -2.284287 | 0.0547003 | 0.3367999 | -0.0473482 | -0.0991375 | -0.363172 |
This is our final analytic data set with transformed self-compassion index variable(s) and gratitude index variable(s) and modified demographic variables.
Now that we have a defined data set, we plan on moving forward with a logistic regression.
First, we want to to ensure that all PCA’s for this data set appear to be Normal
As shown above, the most Normal PCA is the first PCA for the self compassion. Therefore we will move forward for logistic regression of this variable. We will not create a new dummy variable based off our response variable over whether the PCA is positive or negative. If the PCA is positive, we will define the response as 1. Otherwise, the repsonse will be 0.
Due to the response variable being from the self-compassion proportion of the survey, out first model will include all Q2_1 through Q2_12. We will then filter out variables as need be
##
## Call:
## glm(formula = response ~ Q2_1 + Q2_2 + Q2_3 + Q2_4 + Q2_5 + Q2_6 +
## Q2_7 + Q2_8 + Q2_9 + Q2_10 + Q2_11 + Q2_12, data = final.analytic.data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.58648 -0.20501 -0.01231 0.25320 0.48934
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.74998 0.33329 2.250 0.02684 *
## Q2_1 0.04287 0.04177 1.026 0.30752
## Q2_2 -0.09688 0.04423 -2.190 0.03105 *
## Q2_3 -0.02423 0.03896 -0.622 0.53551
## Q2_4 0.03950 0.03012 1.311 0.19303
## Q2_5 -0.02212 0.03833 -0.577 0.56532
## Q2_6 -0.10014 0.03683 -2.719 0.00784 **
## Q2_7 -0.06166 0.03714 -1.660 0.10029
## Q2_8 0.04949 0.03697 1.338 0.18411
## Q2_9 0.05359 0.03728 1.437 0.15401
## Q2_10 -0.02976 0.04056 -0.734 0.46500
## Q2_11 0.05615 0.04812 1.167 0.24634
## Q2_12 0.01312 0.03629 0.361 0.71858
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.09335782)
##
## Null deviance: 25.8462 on 103 degrees of freedom
## Residual deviance: 8.4956 on 91 degrees of freedom
## AIC: 62.635
##
## Number of Fisher Scoring iterations: 2
As shown above, the only statistically significant variables from our first model are Q2_2 and Q2_6.
The above residual plot also showcases how this model is not the best fit.
##
## Call:
## glm(formula = response ~ Q2_2 + Q2_6, data = final.analytic.data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.79757 -0.26541 -0.01549 0.24748 0.76658
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.75284 0.13889 12.620 < 2e-16 ***
## Q2_2 -0.17303 0.04532 -3.818 0.000232 ***
## Q2_6 -0.21809 0.03701 -5.892 5.05e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1304757)
##
## Null deviance: 25.846 on 103 degrees of freedom
## Residual deviance: 13.178 on 101 degrees of freedom
## AIC: 88.292
##
## Number of Fisher Scoring iterations: 2
As shown above, all variables are statistically significant, therefore our final model is \(response = -0.173(Q2_2) - 0.218(Q2_6)\).
In conclusion, our best model for PCA 1 prediction for the self-compassion questions is model 2: \(response = -0.173(Q2_2) - 0.218(Q2_6)\). Further analysis can be done to test further and more complicated models including those with hierarchical terms and interaction terms. We also could have incorporated variables outside the self-compassion scale into our regression model, furthering possibilities of models that may be more accurate for prediction and estimation.