The Best Model for the Principle Components

Introduction

This analysis is for the use of researchers who are looking at the relationship between self-compassion and self-care, specifically for those striving for a higher education among the social working community.The data comes from two surveys and some demo graphical information collected as well. One survey is a collection of questions regarding self-compassion and measured by a scale created by Neff. The second survey is all about gratitude and is measured by a scale from one to six. The models I will be using to analyze the data sets are Anova and Ancova models. The reason I am using these models is because they are meant for response variables which are continuous, which thus data will have.

Data Management and Analytic Data Set Creation

Dealing with Missing Values

The way I dealt with the missing values within the demographic data set was by assigning them with the value 99, which is not assigned to hold any other meaning. After assigning the missing values, they were placed into categories called either “NA” or “Other” so that the observations with missing values could still be included in the final data set. When the missing values were in the compassion or gratitude data set, I replaced them with the mode calculated from the other present values. I chose a different method of dealing with these two data sets because these values are numerically significant and are not simply placeholders for a categorical value like how a number two means female in question 9 of the demographic data. These two data sets are numerically measured, so therefore if one wants to keep the missing value observations, then the value must be replaced with either the mean, mode, or some other meaningful value.

Definitions of Demographic Variables

In the demographic data set, some values were combined to create new categories to avoid small sized categories in a data set that is already on the smaller side. For question 9, I categorized that values into categories called male female, and other. For question 11 regarding race, the categories included Euro-American/White and other. For question 13, the categories under marital status include single, Married/Civil Partner, and other. For question 14 regarding a disability the categories included yes and no. For question 15, regarding religious affiliation, the categories included Christian/Jewish/Buddhist, Higher Power/WitchCraft/NA, and No Religion. For question 16, regarding sexuality, the categories included were Gay/Lesbian/Bisexual, Heterosexual, and Other. For question 17, regarding political affiliation, the categories included were democratic, republican, Independent, and NA. For question 18, regarding education level, the categories included were BSW and MSW.

Combining Survey Items

For this data set, I am using principle component analysis because it is a method that can be used on large data sets or data sets with many variables such as this one, while preserving the largest amount of information possible. This is done by reducing the number of dimensions therefore making the data easier to interpret as well as visualize the trends. The self compassion Score factor loadings include the following. After looking at them I decided to use the first 2 principle components.

## Importance of components:
##                           PC1    PC2    PC3     PC4    PC5     PC6
## Standard deviation     1.8365 0.8957 0.7763 0.72901 0.6317 0.54021
## Proportion of Variance 0.5621 0.1337 0.1004 0.08858 0.0665 0.04864
## Cumulative Proportion  0.5621 0.6958 0.7963 0.88486 0.9514 1.00000

The Gratitude Scores factor loadings include the following. After looking at them I decided to use the first 2 principle components.

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6
## Standard deviation     1.7786 0.9132 0.8335 0.73805 0.66660 0.56481
## Proportion of Variance 0.5272 0.1390 0.1158 0.09079 0.07406 0.05317
## Cumulative Proportion  0.5272 0.6662 0.7820 0.87277 0.94683 1.00000

The final analytic data sets that I used to perform the analysis comes from the four principle components retrieved from these two data sets. There are four final analytic data sets, one for each principle component.

Data Analysis

The Anova and Ancova models will be used to test and analyze the data in order to find which principle component is best explained by the variables education level and religion. Each model will be set with each of the four principle components gathered previously.

Exploratory Data Analysis

The following are frequency tables for each variable within the demographic questionnaire.

## gender
## Female   Male  Other 
##     94      8      2

## race
## Euro-American/White               Other 
##                  95                   9

## marit.stat
## Married/Civil Partner                 Other                Single 
##                    29                    23                    52

## disability
##  No Yes 
##  81  23

## religion
##  Christian/Jewish/Buddhist Higher Power/WitchCraft/NA 
##                         33                         25 
##                No Religion 
##                         46

## sexuality
## Gay/Lesbian/Bisexual         Heterosexual                Other 
##                   16                   78                   10

## poli.affil
##   Democrats Independant          NA  Republican 
##          61          31           3           9

## grp.Edu
## BSW MSW 
##  41  63

The following are histograms show the four different final analytic data sets in histograms. The first, second, and fourth frequency distributions all appear fairly normal but the third histogram shows a very skewed data set frequency.

Regression Models

Anova Models

Here, I am putting each principle component into an anova model and testing each model for homoscedasticity by looking at the residual plots. Of all the models the 2nd model looks the best based off of the lowest p-value, as well as the best looking residuals plots. The p-value explains how likely it would be to get this F-value in the case that the variables are all independent, which is why when the p-value is low, this indicates a correlation between the 2nd self-compassion principle component and level of education. When looking at residual plots, it’s good when the mean residuals line (the red line) is centered close to zero, indicating no extreme outliers that might cause bias.

##              Df Sum Sq Mean Sq F value Pr(>F)
## grp.Edu       1    5.7   5.707   1.703  0.195
## Residuals   102  341.7   3.350

##              Df Sum Sq Mean Sq F value Pr(>F)  
## grp.Edu       1   3.50   3.496   4.506 0.0362 *
## Residuals   102  79.14   0.776                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##              Df Sum Sq Mean Sq F value Pr(>F)
## grp.Edu       1    1.9   1.881   0.592  0.443
## Residuals   102  323.9   3.176

##              Df Sum Sq Mean Sq F value Pr(>F)
## grp.Edu       1   1.83  1.8348   2.227  0.139
## Residuals   102  84.05  0.8241

Ancova Models

Here, I tested the principle components in an Ancova model to see if these models would yield better results. Again, a low p-value is what I am looking fo first when reading the results to the ancova model. Based on the p-values the 2nd, 3rd, and 4th models are showing significance between education level, religion, and the principle components. Thinking back on the assumptions for the Anova models, the second model still performed the best in the assumptions tests, making it the best model of these four Ancova models.

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(aov(sc.idx1 ~ grp.Edu + religion, data = final.analytic.data))
## W = 0.99229, p-value = 0.8242

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(aov(sc.idx2 ~ grp.Edu + religion, data = final.analytic.data))
## W = 0.9432, p-value = 0.0002255

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(aov(g.idx2.1 ~ grp.Edu + religion, data = final.analytic.data))
## W = 0.81586, p-value = 4.586e-10

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(aov(g.idx2.2 ~ grp.Edu + religion, data = final.analytic.data))
## W = 0.90616, p-value = 1.877e-06

Conclusion and Discussion

After the several steps of analysis performed, it can be concluded that there is some sort of correlation between some of the principle components and the variables, religion and education level. Some of the correlations are stronger then the others, but there isn’t enough information to conclude a direct causation. With this in mind, there could be some further analysis performed on the data to glean more information from the data set. If a rigorous analysis was performed to compare all of the demographic variables, finding the most significant ones, as well as throwing away the less significant variables, the researchers may be able to start a new study focusing only on these specific demographic variables. Furthermore, I would suggest collecting more data since there are so many variables, because with so many variables and so few observations, the models may lose some of their power, hence losing their usefulness for research purposes. This experiment might also benefit from more then just linear regression. As only linear regression was performed, there is a chance that a model reflecting a quadratic equation could be a better fit for some of the principle components.

The Best Model for the Principle Components

Mikaela Taylor

11/20/2022