1 Loading Libraries

library(expss) # for the cross_cases() command
## Loading required package: maditr
## 
## To get total summary skip 'by' argument: take_all(mtcars, mean)
library(psych) # for the describe() command
library(car) # for the leveneTest() command
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
## The following object is masked from 'package:expss':
## 
##     recode
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:expss':
## 
##     compute, contains, na_if, recode, vars, where
## The following objects are masked from 'package:maditr':
## 
##     between, coalesce, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(effsize) # for the cohen.d() command
## 
## Attaching package: 'effsize'
## The following object is masked from 'package:psych':
## 
##     cohen.d

2 Importing Data

# import the dataset you cleaned previously
# this will be the dataset you'll use throughout the rest of the semester
d <- read.csv(file="~/Desktop/P421.R.PROJ/EAMMi2 Project/current and future HW/EAMMi2clean.csv", header=T)

3 Chi Square: Hypothesis

There will be sex differences in political affiliation, with women being more likely than men to identify as more liberal-leaning. Specifically, women will demonstrate a higher prevalence in political categories associated with liberal ideologies, resulting in an uneven distribution of participants across political affiliation categories, which includes men, women, and non-binary individuals.

4 Chi Square: Check Your Variables

# you only need to check the variables you're using in the current analysis
# although you checked them previously, it's always a good idea to look them over again and be sure that everything is correct
str(d)
## 'data.frame':    3182 obs. of  7 variables:
##  $ ResponseId: chr  "R_BJN3bQqi1zUMid3" "R_2TGbiBXmAtxywsD" "R_12G7bIqN2wB2N65" "R_39pldNoon8CePfP" ...
##  $ politics  : int  2 1 2 8 1 8 4 2 8 4 ...
##  $ sex       : int  2 1 1 2 1 2 2 2 2 2 ...
##  $ moa       : num  3.2 3.1 3.05 2.3 3.1 3.35 3.65 3.7 3.55 2.95 ...
##  $ swb       : num  4.33 4.17 1.83 5.17 3.67 ...
##  $ stress    : num  3.3 3.6 3.3 3.2 3.5 2.9 3.2 3 2.9 3.2 ...
##  $ mindful   : num  2.4 1.8 2.2 2.2 3.2 ...
# we can see in the str() command that our categorical variables are being read as character or string variables
# to correct this, we'll use the as.factor() command
d$politics <- as.factor(d$politics)
d$sex <- as.factor(d$sex)

table(d$politics, useNA = "always")
## 
##    1    2    3    4    5    6    7    8 <NA> 
##  235  772  373  568  308  332   57  532    5
table(d$sex, useNA = "always")
## 
##    1    2    3 <NA> 
##  792 2332   54    4
cross_cases(d, sex, politics)
 politics 
 1   2   3   4   5   6   7   8 
 sex 
   1  44 149 100 154 109 94 23 118
   2  172 598 271 409 198 238 34 408
   3  18 24 2 5 1 4
   #Total cases  234 771 373 568 308 332 57 530

5 Chi Square: Assumptions

5.1 Chi-square Test Assumptions

  • Data should be frequencies or counts
  • Variables and levels should be independent
  • There are two variables (focusing on categorical)
  • At least 5 or more participants per cell

5.2 Issues with My Data

While my data meets the first three assumptions, I don’t have at least 5 participants in all cells. The number of non-binary participants does not cross the five (5) participants/cell threshold for every political affiliation level. Further, there is large difference between the number of men (1(791)) and women (2(2328)) in the sample pool.

To proceed with this analysis, I will drop the non-binary participants from my sample. Dropping underrepresented participants is always a difficult choice and has the potential to further marginalize the group, but it is a necessary compromise to complete my analysis.

I will make a note of these issues in the Method and Discussion write-up as a limitation of the study. The decision to drop underrepresented participants may have implications for the ability to generalize the findings and raises concerns about potential marginalization. Further research should strive for more diverse and inclusive participant representation. Additionally, having a larger number of women(2) than men(1) in the sample may allow for more reliable and precise gender comparisons. However, it is important to interpret these comparisons with caution.

# we'll use the subset command to drop our non-binary participants
d <- subset(d, sex != "3") #using the '!=' sign here tells R to filter out the indicated criteria
# once we've dropped a level from our factor, we need to use the droplevels() command to remove it, or it will still show as 0
d$sex <- droplevels(d$sex)

table(d$sex, useNA = "always")
## 
##    1    2 <NA> 
##  792 2332    0
# since I made changes to my variables, I am going to re-run the cross_cases() command
cross_cases(d, sex, politics)
 politics 
 1   2   3   4   5   6   7   8 
 sex 
   1  44 149 100 154 109 94 23 118
   2  172 598 271 409 198 238 34 408
   #Total cases  216 747 371 563 307 332 57 526

6 Run a Chi-square Test

# we use the chisq.test() command to run our chi-square test
# the only arguments we need to specify are the variables we're using for the chi-square test
# we are saving the output from our chi-square test to the chi_output object so we can view it again later
chi_output <- chisq.test(d$sex, d$politics)

7 Chi Square: View Test Output

# to view the results of our chi-square test, we just have to call up the output we saved
chi_output
## 
##  Pearson's Chi-squared test
## 
## data:  d$sex and d$politics
## X-squared = 43.455, df = 7, p-value = 2.725e-07

8 Chi Square: View Standardized Residuals

# to view the standardized residuals, we use the $ operator to access the stdres element of the chi_output file that we created
chi_output$stdres
##      d$politics
## d$sex         1         2         3         4         5         6         7
##     1 -1.747331 -3.900172  0.751571  1.200534  4.302515  1.308105  2.625338
##     2  1.747331  3.900172 -0.751571 -1.200534 -4.302515 -1.308105 -2.625338
##      d$politics
## d$sex         8
##     1 -1.692358
##     2  1.692358

9 Chi Square: Write Up Results

To test our hypothesis that there would be sex differences in participation across the political affiliation categories, we ran a Chi-square test of independence. Our variables met most of the criteria for running a chi-square test of analysis (it used frequencies, the variables were independent, and there were two variables). However, we had a low number of non-binary participants which did not meet the criteria for at least five (5) participants per cell. To proceed with this analysis, we dropped the non-binary participants from our sample. The final sample for analysis can be seen in Table 1:

 politics 
 1   2   3   4   5   6   7   8 
 sex 
   1  44 149 100 154 109 94 23 118
   2  172 598 271 409 198 238 34 408

There is variation in the counts across different political affiliation categories for both sexes:

As predicted, we found a sex difference across the political affiliation categories, χ2(7, N = 3124) = 43.455, p = 2.725e-07.

–alternative–

As predicted, we found that women(2) were more likely to identify with liberal leaning political affiliation categories than men(1), X^2(7, N = 3124) = 43.455, p = 2.725e-07.

The extremely small p-value suggests strong evidence against the null hypothesis of independence, indicating that there is a significant association between sex and politics in the studied sample.

We have observed that women(2) are overrepresented in the liberal political category compared to the expected frequencies, while they are underrepresented in the conservative political category.

The standardized residual of women(2) and liberal identification(2) is 3.90. This positive value indicates a substantially higher observed frequency than expected. Conversely,the standardized residual of women(2) and slightly conservative identification(5) is -4.30. This negative value indicates a substantially lower observed frequency than expected. These findings suggest that women(2), relative to men(1), may have a higher likelihood of identifying with liberal leaning political affiliations than conservative leaning political affiliations. This supports our hypothesis.

10 T-test: Hypothesis

Individuals holding liberal-leaning political beliefs will assign lesser importance to milestones in achieving adulthood, in comparison to individuals with conservative-leaning political beliefs. This prediction is based on the analysis of the MOA survey reports within the EAMMi2 data set.

11 T-test: Check Your Variables

# you only need to check the variables you're using in the current analysis
# although you checked them previously, it's always a good idea to look them over again and be sure that everything is correct
str(d)
## 'data.frame':    3124 obs. of  7 variables:
##  $ ResponseId: chr  "R_BJN3bQqi1zUMid3" "R_2TGbiBXmAtxywsD" "R_12G7bIqN2wB2N65" "R_39pldNoon8CePfP" ...
##  $ politics  : Factor w/ 8 levels "1","2","3","4",..: 2 1 2 8 1 8 4 2 8 4 ...
##  $ sex       : Factor w/ 2 levels "1","2": 2 1 1 2 1 2 2 2 2 2 ...
##  $ moa       : num  3.2 3.1 3.05 2.3 3.1 3.35 3.65 3.7 3.55 2.95 ...
##  $ swb       : num  4.33 4.17 1.83 5.17 3.67 ...
##  $ stress    : num  3.3 3.6 3.3 3.2 3.5 2.9 3.2 3 2.9 3.2 ...
##  $ mindful   : num  2.4 1.8 2.2 2.2 3.2 ...
d$politics <- as.factor(d$politics)

table(d$politics, useNA = "always")
## 
##    1    2    3    4    5    6    7    8 <NA> 
##  216  747  371  563  307  332   57  526    5
# you can use the describe() command on an entire datafrom (d) or just on a single variable (d$pss)
describe(d$moa)
##    vars    n mean   sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 2951 3.28 0.45   3.35    3.32 0.44 1.3   4   2.7 -0.69     0.13 0.01
# also use a histogram to examine your continuous variable
hist(d$moa)

# can use the describeBy() command to view the means and standard deviations by group
# it's very similar to the describe() command but splits the dataframe according to the 'group' variable
describeBy(d$moa, group=d$politics)
## 
##  Descriptive statistics by group 
## group: 1
##    vars   n mean   sd median trimmed  mad  min max range  skew kurtosis   se
## X1    1 206  3.1 0.46    3.1    3.11 0.48 1.95   4  2.05 -0.13    -0.68 0.03
## ------------------------------------------------------------ 
## group: 2
##    vars   n mean   sd median trimmed  mad  min max range  skew kurtosis   se
## X1    1 708 3.19 0.44   3.25    3.21 0.52 1.65   4  2.35 -0.43    -0.32 0.02
## ------------------------------------------------------------ 
## group: 3
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 352 3.23 0.43    3.3    3.26 0.44   2   4     2 -0.5    -0.24 0.02
## ------------------------------------------------------------ 
## group: 4
##    vars   n mean   sd median trimmed  mad  min max range  skew kurtosis   se
## X1    1 522 3.35 0.44   3.45    3.39 0.44 1.95   4  2.05 -0.82     0.15 0.02
## ------------------------------------------------------------ 
## group: 5
##    vars   n mean   sd median trimmed  mad  min max range  skew kurtosis   se
## X1    1 300 3.32 0.41    3.4    3.35 0.41 2.05   4  1.95 -0.53     -0.3 0.02
## ------------------------------------------------------------ 
## group: 6
##    vars   n mean   sd median trimmed  mad  min max range  skew kurtosis   se
## X1    1 316 3.43 0.42    3.5    3.47 0.37 1.75   4  2.25 -1.12     1.46 0.02
## ------------------------------------------------------------ 
## group: 7
##    vars  n mean   sd median trimmed  mad  min max range  skew kurtosis   se
## X1    1 51 3.43 0.42    3.5    3.47 0.44 2.05   4  1.95 -1.03     1.06 0.06
## ------------------------------------------------------------ 
## group: 8
##    vars   n mean   sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 492 3.32 0.46    3.4    3.37 0.37 1.3   4   2.7 -1.16     1.69 0.02
# last, use a boxplot to examine your continuous and categorical variables together
boxplot(d$moa~d$politics)

12 T-test: Assumptions

12.1 T-test Assumptions

  • IV must have two levels
  • Data values must be independent (independent t-test only)
  • Data obtained via a random sample
  • Dependent variable must be normally distributed
  • Variances of the two groups are approximately equal

12.2 Testing Homogeneity of Variance with Levene’s Test

d <- subset(d, politics != "8")
table(d$politics, useNA = "always")
## 
##    1    2    3    4    5    6    7    8 <NA> 
##  216  747  371  563  307  332   57    0    0
d$politics <- droplevels(d$politics) # using droplevels() to drop the empty factor

d <- subset(d, politics != "4")
table(d$politics, useNA = "always")
## 
##    1    2    3    4    5    6    7 <NA> 
##  216  747  371    0  307  332   57    0
d$politics <- droplevels(d$politics)

# Assuming the original political affiliation variable is named "politics"

# Modify the levels of the "politics" variable
d <- d %>%
  mutate(politics = case_when(
    politics %in% c(1, 2, 3) ~ "liberal",
    politics %in% c(5, 6, 7) ~ "conservative",
    TRUE ~ as.character(politics)  # Keep other values as they are
  ))

table(d$politics, useNA = "always")
## 
## conservative      liberal         <NA> 
##          696         1334            0
# Check the updated data
head(data)
##                                                                             
## 1 function (..., list = character(), package = NULL, lib.loc = NULL,        
## 2     verbose = getOption("verbose"), envir = .GlobalEnv, overwrite = TRUE) 
## 3 {                                                                         
## 4     fileExt <- function(x) {                                              
## 5         db <- grepl("\\\\.[^.]+\\\\.(gz|bz2|xz)$", x)                     
## 6         ans <- sub(".*\\\\.", "", x)
# use the leveneTest() command from the car package to test homogeneity of variance
# uses the same 'formula' setup that we'll use for our t-test: formula is y~x, where y is our DV and x is our IV
leveneTest(moa~politics, data = d)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value   Pr(>F)   
## group    1  10.076 0.001526 **
##       1931                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

12.3 T-test: Issues with My Data

My independent variable ‘politics’ has more than two levels. To proceed with this analysis, I will combine the liberal and conservative variations into two super categories ‘liberal’ (combining scores from levels 1,2,3) and ‘conservative’ (combining scores from levels 5,6,7) from my sample. I will make a note to discuss this issue in my Method write-up and in my Discussion as a limitation of my study.

13 Run a T-test

# very simple! we specify the dataframe alongside the variables instead of having a separate argument for the dataframe like we did for leveneTest()
t_output <- t.test(d$moa~d$politics)

14 T-test: View Test Output

t_output
## 
##  Welch Two Sample t-test
## 
## data:  d$moa by d$politics
## t = 9.4737, df = 1429.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group conservative and group liberal is not equal to 0
## 95 percent confidence interval:
##  0.1532110 0.2332268
## sample estimates:
## mean in group conservative      mean in group liberal 
##                   3.380660                   3.187441

15 T-test: Calculate Cohen’s d

# once again, we use our formula to calculate cohen's d
d_output <- cohen.d(d$moa~d$politics)

16 T-test: View Effect Size

d_output
## 
## Cohen's d
## 
## d estimate: 0.4446365 (small)
## 95 percent confidence interval:
##     lower     upper 
## 0.3497610 0.5395121

17 T-test: Write Up Results

To test our hypothesis that individuals of liberal-leaning political affiliations in our sample would assess markers of adulthood of less importance than conservative-leaning individuals, I used an Welch’s two sample / independent t-test. This required us to drop our moderate and non-/a-political participants from our sample, as well as combining the degrees of affiliation to liberal or conservative politics into two super categories. We tested the homogeneity of variance with Levene’s test and found p = 0.002. Here, we can conclude that there is a significant difference in the variances between the two groups being compared. The result suggests that the assumption of homogeneity of variance is violated, indicating unequal variances between the groups. To correct for this possible issue, we use Welch’s t-test, which does not assume homogeneity of variance. Our data met all other assumptions of a t-test.

As predicted, we found that liberal-leaning individuals (M = 3.19) regarded markers of adulthood to be of significantly less importance than conservative-leaning individuals (M = 3.38); t(1429.7) = 9.47, p < 2.2e-16 (see Figure 1). The effect size was calculated using Cohen’s d, with a value of 0.445 (small effect; Cohen, 1988).

References

Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences. New York, NY: Routledge Academic.