library(expss) # for the cross_cases() command
## Loading required package: maditr
##
## To get total summary skip 'by' argument: take_all(mtcars, mean)
library(psych) # for the describe() command
library(car) # for the leveneTest() command
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:expss':
##
## recode
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:expss':
##
## compute, contains, na_if, recode, vars, where
## The following objects are masked from 'package:maditr':
##
## between, coalesce, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(effsize) # for the cohen.d() command
##
## Attaching package: 'effsize'
## The following object is masked from 'package:psych':
##
## cohen.d
# import the dataset you cleaned previously
# this will be the dataset you'll use throughout the rest of the semester
d <- read.csv(file="~/Desktop/P421.R.PROJ/EAMMi2 Project/current and future HW/EAMMi2clean.csv", header=T)
There will be sex differences in political affiliation, with women being more likely than men to identify as more liberal-leaning. Specifically, women will demonstrate a higher prevalence in political categories associated with liberal ideologies, resulting in an uneven distribution of participants across political affiliation categories, which includes men, women, and non-binary individuals.
# you only need to check the variables you're using in the current analysis
# although you checked them previously, it's always a good idea to look them over again and be sure that everything is correct
str(d)
## 'data.frame': 3182 obs. of 7 variables:
## $ ResponseId: chr "R_BJN3bQqi1zUMid3" "R_2TGbiBXmAtxywsD" "R_12G7bIqN2wB2N65" "R_39pldNoon8CePfP" ...
## $ politics : int 2 1 2 8 1 8 4 2 8 4 ...
## $ sex : int 2 1 1 2 1 2 2 2 2 2 ...
## $ moa : num 3.2 3.1 3.05 2.3 3.1 3.35 3.65 3.7 3.55 2.95 ...
## $ swb : num 4.33 4.17 1.83 5.17 3.67 ...
## $ stress : num 3.3 3.6 3.3 3.2 3.5 2.9 3.2 3 2.9 3.2 ...
## $ mindful : num 2.4 1.8 2.2 2.2 3.2 ...
# we can see in the str() command that our categorical variables are being read as character or string variables
# to correct this, we'll use the as.factor() command
d$politics <- as.factor(d$politics)
d$sex <- as.factor(d$sex)
table(d$politics, useNA = "always")
##
## 1 2 3 4 5 6 7 8 <NA>
## 235 772 373 568 308 332 57 532 5
table(d$sex, useNA = "always")
##
## 1 2 3 <NA>
## 792 2332 54 4
cross_cases(d, sex, politics)
|  politics | ||||||||
|---|---|---|---|---|---|---|---|---|
| Â 1Â | Â 2Â | Â 3Â | Â 4Â | Â 5Â | Â 6Â | Â 7Â | Â 8Â | |
|  sex | ||||||||
| Â Â Â 1Â | 44 | 149 | 100 | 154 | 109 | 94 | 23 | 118 |
| Â Â Â 2Â | 172 | 598 | 271 | 409 | 198 | 238 | 34 | 408 |
| Â Â Â 3Â | 18 | 24 | 2 | 5 | 1 | 4 | ||
|    #Total cases | 234 | 771 | 373 | 568 | 308 | 332 | 57 | 530 |
While my data meets the first three assumptions, I don’t have at least 5 participants in all cells. The number of non-binary participants does not cross the five (5) participants/cell threshold for every political affiliation level. Further, there is large difference between the number of men (1(791)) and women (2(2328)) in the sample pool.
To proceed with this analysis, I will drop the non-binary participants from my sample. Dropping underrepresented participants is always a difficult choice and has the potential to further marginalize the group, but it is a necessary compromise to complete my analysis.
I will make a note of these issues in the Method and Discussion write-up as a limitation of the study. The decision to drop underrepresented participants may have implications for the ability to generalize the findings and raises concerns about potential marginalization. Further research should strive for more diverse and inclusive participant representation. Additionally, having a larger number of women(2) than men(1) in the sample may allow for more reliable and precise gender comparisons. However, it is important to interpret these comparisons with caution.
# we'll use the subset command to drop our non-binary participants
d <- subset(d, sex != "3") #using the '!=' sign here tells R to filter out the indicated criteria
# once we've dropped a level from our factor, we need to use the droplevels() command to remove it, or it will still show as 0
d$sex <- droplevels(d$sex)
table(d$sex, useNA = "always")
##
## 1 2 <NA>
## 792 2332 0
# since I made changes to my variables, I am going to re-run the cross_cases() command
cross_cases(d, sex, politics)
|  politics | ||||||||
|---|---|---|---|---|---|---|---|---|
| Â 1Â | Â 2Â | Â 3Â | Â 4Â | Â 5Â | Â 6Â | Â 7Â | Â 8Â | |
|  sex | ||||||||
| Â Â Â 1Â | 44 | 149 | 100 | 154 | 109 | 94 | 23 | 118 |
| Â Â Â 2Â | 172 | 598 | 271 | 409 | 198 | 238 | 34 | 408 |
|    #Total cases | 216 | 747 | 371 | 563 | 307 | 332 | 57 | 526 |
# we use the chisq.test() command to run our chi-square test
# the only arguments we need to specify are the variables we're using for the chi-square test
# we are saving the output from our chi-square test to the chi_output object so we can view it again later
chi_output <- chisq.test(d$sex, d$politics)
# to view the results of our chi-square test, we just have to call up the output we saved
chi_output
##
## Pearson's Chi-squared test
##
## data: d$sex and d$politics
## X-squared = 43.455, df = 7, p-value = 2.725e-07
# to view the standardized residuals, we use the $ operator to access the stdres element of the chi_output file that we created
chi_output$stdres
## d$politics
## d$sex 1 2 3 4 5 6 7
## 1 -1.747331 -3.900172 0.751571 1.200534 4.302515 1.308105 2.625338
## 2 1.747331 3.900172 -0.751571 -1.200534 -4.302515 -1.308105 -2.625338
## d$politics
## d$sex 8
## 1 -1.692358
## 2 1.692358
To test our hypothesis that there would be sex differences in participation across the political affiliation categories, we ran a Chi-square test of independence. Our variables met most of the criteria for running a chi-square test of analysis (it used frequencies, the variables were independent, and there were two variables). However, we had a low number of non-binary participants which did not meet the criteria for at least five (5) participants per cell. To proceed with this analysis, we dropped the non-binary participants from our sample. The final sample for analysis can be seen in Table 1:
|  politics | ||||||||
|---|---|---|---|---|---|---|---|---|
| Â 1Â | Â 2Â | Â 3Â | Â 4Â | Â 5Â | Â 6Â | Â 7Â | Â 8Â | |
|  sex | ||||||||
| Â Â Â 1Â | 44 | 149 | 100 | 154 | 109 | 94 | 23 | 118 |
| Â Â Â 2Â | 172 | 598 | 271 | 409 | 198 | 238 | 34 | 408 |
There is variation in the counts across different political affiliation categories for both sexes:
As predicted, we found a sex difference across the political affiliation categories, χ2(7, N = 3124) = 43.455, p = 2.725e-07.
–alternative–
As predicted, we found that women(2) were more likely to identify with liberal leaning political affiliation categories than men(1), X^2(7, N = 3124) = 43.455, p = 2.725e-07.
The extremely small p-value suggests strong evidence against the null hypothesis of independence, indicating that there is a significant association between sex and politics in the studied sample.
We have observed that women(2) are overrepresented in the liberal political category compared to the expected frequencies, while they are underrepresented in the conservative political category.
The standardized residual of women(2) and liberal identification(2) is 3.90. This positive value indicates a substantially higher observed frequency than expected. Conversely,the standardized residual of women(2) and slightly conservative identification(5) is -4.30. This negative value indicates a substantially lower observed frequency than expected. These findings suggest that women(2), relative to men(1), may have a higher likelihood of identifying with liberal leaning political affiliations than conservative leaning political affiliations. This supports our hypothesis.
Individuals holding liberal-leaning political beliefs will assign lesser importance to milestones in achieving adulthood, in comparison to individuals with conservative-leaning political beliefs. This prediction is based on the analysis of the MOA survey reports within the EAMMi2 data set.
# you only need to check the variables you're using in the current analysis
# although you checked them previously, it's always a good idea to look them over again and be sure that everything is correct
str(d)
## 'data.frame': 3124 obs. of 7 variables:
## $ ResponseId: chr "R_BJN3bQqi1zUMid3" "R_2TGbiBXmAtxywsD" "R_12G7bIqN2wB2N65" "R_39pldNoon8CePfP" ...
## $ politics : Factor w/ 8 levels "1","2","3","4",..: 2 1 2 8 1 8 4 2 8 4 ...
## $ sex : Factor w/ 2 levels "1","2": 2 1 1 2 1 2 2 2 2 2 ...
## $ moa : num 3.2 3.1 3.05 2.3 3.1 3.35 3.65 3.7 3.55 2.95 ...
## $ swb : num 4.33 4.17 1.83 5.17 3.67 ...
## $ stress : num 3.3 3.6 3.3 3.2 3.5 2.9 3.2 3 2.9 3.2 ...
## $ mindful : num 2.4 1.8 2.2 2.2 3.2 ...
d$politics <- as.factor(d$politics)
table(d$politics, useNA = "always")
##
## 1 2 3 4 5 6 7 8 <NA>
## 216 747 371 563 307 332 57 526 5
# you can use the describe() command on an entire datafrom (d) or just on a single variable (d$pss)
describe(d$moa)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 2951 3.28 0.45 3.35 3.32 0.44 1.3 4 2.7 -0.69 0.13 0.01
# also use a histogram to examine your continuous variable
hist(d$moa)
# can use the describeBy() command to view the means and standard deviations by group
# it's very similar to the describe() command but splits the dataframe according to the 'group' variable
describeBy(d$moa, group=d$politics)
##
## Descriptive statistics by group
## group: 1
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 206 3.1 0.46 3.1 3.11 0.48 1.95 4 2.05 -0.13 -0.68 0.03
## ------------------------------------------------------------
## group: 2
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 708 3.19 0.44 3.25 3.21 0.52 1.65 4 2.35 -0.43 -0.32 0.02
## ------------------------------------------------------------
## group: 3
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 352 3.23 0.43 3.3 3.26 0.44 2 4 2 -0.5 -0.24 0.02
## ------------------------------------------------------------
## group: 4
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 522 3.35 0.44 3.45 3.39 0.44 1.95 4 2.05 -0.82 0.15 0.02
## ------------------------------------------------------------
## group: 5
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 300 3.32 0.41 3.4 3.35 0.41 2.05 4 1.95 -0.53 -0.3 0.02
## ------------------------------------------------------------
## group: 6
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 316 3.43 0.42 3.5 3.47 0.37 1.75 4 2.25 -1.12 1.46 0.02
## ------------------------------------------------------------
## group: 7
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 51 3.43 0.42 3.5 3.47 0.44 2.05 4 1.95 -1.03 1.06 0.06
## ------------------------------------------------------------
## group: 8
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 492 3.32 0.46 3.4 3.37 0.37 1.3 4 2.7 -1.16 1.69 0.02
# last, use a boxplot to examine your continuous and categorical variables together
boxplot(d$moa~d$politics)
d <- subset(d, politics != "8")
table(d$politics, useNA = "always")
##
## 1 2 3 4 5 6 7 8 <NA>
## 216 747 371 563 307 332 57 0 0
d$politics <- droplevels(d$politics) # using droplevels() to drop the empty factor
d <- subset(d, politics != "4")
table(d$politics, useNA = "always")
##
## 1 2 3 4 5 6 7 <NA>
## 216 747 371 0 307 332 57 0
d$politics <- droplevels(d$politics)
# Assuming the original political affiliation variable is named "politics"
# Modify the levels of the "politics" variable
d <- d %>%
mutate(politics = case_when(
politics %in% c(1, 2, 3) ~ "liberal",
politics %in% c(5, 6, 7) ~ "conservative",
TRUE ~ as.character(politics) # Keep other values as they are
))
table(d$politics, useNA = "always")
##
## conservative liberal <NA>
## 696 1334 0
# Check the updated data
head(data)
##
## 1 function (..., list = character(), package = NULL, lib.loc = NULL,
## 2 verbose = getOption("verbose"), envir = .GlobalEnv, overwrite = TRUE)
## 3 {
## 4 fileExt <- function(x) {
## 5 db <- grepl("\\\\.[^.]+\\\\.(gz|bz2|xz)$", x)
## 6 ans <- sub(".*\\\\.", "", x)
# use the leveneTest() command from the car package to test homogeneity of variance
# uses the same 'formula' setup that we'll use for our t-test: formula is y~x, where y is our DV and x is our IV
leveneTest(moa~politics, data = d)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 10.076 0.001526 **
## 1931
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
My independent variable ‘politics’ has more than two levels. To proceed with this analysis, I will combine the liberal and conservative variations into two super categories ‘liberal’ (combining scores from levels 1,2,3) and ‘conservative’ (combining scores from levels 5,6,7) from my sample. I will make a note to discuss this issue in my Method write-up and in my Discussion as a limitation of my study.
# very simple! we specify the dataframe alongside the variables instead of having a separate argument for the dataframe like we did for leveneTest()
t_output <- t.test(d$moa~d$politics)
t_output
##
## Welch Two Sample t-test
##
## data: d$moa by d$politics
## t = 9.4737, df = 1429.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group conservative and group liberal is not equal to 0
## 95 percent confidence interval:
## 0.1532110 0.2332268
## sample estimates:
## mean in group conservative mean in group liberal
## 3.380660 3.187441
# once again, we use our formula to calculate cohen's d
d_output <- cohen.d(d$moa~d$politics)
d_output
##
## Cohen's d
##
## d estimate: 0.4446365 (small)
## 95 percent confidence interval:
## lower upper
## 0.3497610 0.5395121
To test our hypothesis that individuals of liberal-leaning political affiliations in our sample would assess markers of adulthood of less importance than conservative-leaning individuals, I used an Welch’s two sample / independent t-test. This required us to drop our moderate and non-/a-political participants from our sample, as well as combining the degrees of affiliation to liberal or conservative politics into two super categories. We tested the homogeneity of variance with Levene’s test and found p = 0.002. Here, we can conclude that there is a significant difference in the variances between the two groups being compared. The result suggests that the assumption of homogeneity of variance is violated, indicating unequal variances between the groups. To correct for this possible issue, we use Welch’s t-test, which does not assume homogeneity of variance. Our data met all other assumptions of a t-test.
As predicted, we found that liberal-leaning individuals (M = 3.19) regarded markers of adulthood to be of significantly less importance than conservative-leaning individuals (M = 3.38); t(1429.7) = 9.47, p < 2.2e-16 (see Figure 1). The effect size was calculated using Cohen’s d, with a value of 0.445 (small effect; Cohen, 1988).
References
Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences. New York, NY: Routledge Academic.