library(expss) # for the cross_cases() command
## Loading required package: maditr
##
## To drop variable use NULL: let(mtcars, am = NULL) %>% head()
library(psych) # for the describe() command
library(car) # for the leveneTest() command
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:expss':
##
## recode
library(effsize) # for the cohen.d() command
##
## Attaching package: 'effsize'
## The following object is masked from 'package:psych':
##
## cohen.d
# import the dataset you cleaned previously
# this will be the dataset you'll use throughout the rest of the semester
d <- read.csv(file="Data/final.csv", header=T)
There will be differences in participants’ income across the sex categories (in other words, participants income levels will not be evenly distributed across the sex categories).
# you only need to check the variables you're using in the current analysis
# although you checked them previously, it's always a good idea to look them over again and be sure that everything is correct
str(d)
## 'data.frame': 3182 obs. of 6 variables:
## $ sex : int 2 1 1 2 1 2 2 2 2 2 ...
## $ income : int 3 3 1 1 6 1 2 3 7 1 ...
## $ belong : num 2.6 4.2 3.8 4.2 3.4 4.2 4.3 3.8 2.9 2.5 ...
## $ stress : num 3.1 3.8 4.3 3 3.3 3.7 3.4 2.2 2.9 2.6 ...
## $ swb : num 4.33 4.17 1.83 5.17 3.67 ...
## $ SocMedia: num 4.27 2.09 3.09 3.18 3.36 ...
# we can see in the str() command that our categorical variables are being read as character or string variables
# to correct this, we'll use the as.factor() command
d$sex <- as.factor(d$sex)
d$income <- as.factor(d$income)
table(d$sex, useNA = "always")
##
## 1 2 3 <NA>
## 792 2332 54 4
table(d$income, useNA = "always")
##
## 1 2 3 4 5 6 7 8 9 <NA>
## 860 518 361 344 302 236 389 140 7 25
cross_cases(d, sex, income)
|  income | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Â 1Â | Â 2Â | Â 3Â | Â 4Â | Â 5Â | Â 6Â | Â 7Â | Â 8Â | Â 9Â | |
|  sex | |||||||||
| Â Â Â 1Â | 210 | 118 | 100 | 79 | 65 | 61 | 117 | 37 | 2 |
| Â Â Â 2Â | 630 | 387 | 257 | 260 | 234 | 170 | 271 | 100 | 5 |
| Â Â Â 3Â | 20 | 13 | 4 | 5 | 3 | 5 | 1 | 3 | |
|    #Total cases | 860 | 518 | 361 | 344 | 302 | 236 | 389 | 140 | 7 |
While my data meets the first three assumptions, I don’t have at least 5 participants in all cells. The number of other sex participants is small, and for the income category of 9, it is small. This created an issue with having 5 participants per cell.
To proceed with this analysis, I will drop the other sex participants from my sample and add the 9 income (1,000,000+) participants to the 8 category (200,000-1,000,000). I will make a note to discuss this issue in my Method write-up and in my Discussion as a limitation of my study.
# we'll use the subset command to drop our non-binary participants
d <- subset(d, sex != "3")
#using the '!=' sign here tells R to filter out the indicated criteria
# once we've dropped a level from our factor, we need to use the droplevels() command to remove it, or it will still show as 0
table(d$sex, useNA = "always")
##
## 1 2 3 <NA>
## 792 2332 0 0
d$sex<- droplevels(d$sex)
# we'll recode our race variable to combine our native american participants with our other participants
d$income2 <- d$income
# create a new variable (race_rc2 identical to current variable (race_rc)
d$income2[d$income == "9"] <- "8"
table(d$income2, useNA = "always")
##
## 1 2 3 4 5 6 7 8 9 <NA>
## 840 505 357 339 299 231 388 144 0 21
# we will use some of our previous code to recode our Native American participants
d$income2 <- droplevels(d$income2)
table(d$income2, useNA = "always")
##
## 1 2 3 4 5 6 7 8 <NA>
## 840 505 357 339 299 231 388 144 21
# since I made changes to my variables, I am going to re-run the cross_cases() command
cross_cases(d, sex, income2)
| Â income2Â | ||||||||
|---|---|---|---|---|---|---|---|---|
| Â 1Â | Â 2Â | Â 3Â | Â 4Â | Â 5Â | Â 6Â | Â 7Â | Â 8Â | |
|  sex | ||||||||
| Â Â Â 1Â | 210 | 118 | 100 | 79 | 65 | 61 | 117 | 39 |
| Â Â Â 2Â | 630 | 387 | 257 | 260 | 234 | 170 | 271 | 105 |
|    #Total cases | 840 | 505 | 357 | 339 | 299 | 231 | 388 | 144 |
# we use the chisq.test() command to run our chi-square test
# the only arguments we need to specify are the variables we're using for the chi-square test
# we are saving the output from our chi-square test to the chi_output object so we can view it again later
chi_output <- chisq.test(d$sex, d$income2)
# to view the results of our chi-square test, we just have to call up the output we saved
chi_output
##
## Pearson's Chi-squared test
##
## data: d$sex and d$income2
## X-squared = 10.318, df = 7, p-value = 0.1712
# to view the standardized residuals, we use the $ operator to access the stdres element of the chi_output file that we created
chi_output$stdres
## d$income2
## d$sex 1 2 3 4 5 6
## 1 -0.3328005 -1.1622162 1.1919613 -0.9511924 -1.5405484 0.3555142
## 2 0.3328005 1.1622162 -1.1919613 0.9511924 1.5405484 -0.3555142
## d$income2
## d$sex 7 8
## 1 2.2862683 0.4674201
## 2 -2.2862683 -0.4674201
To test my hypothesis that there will be differences in participants’ income across the sex categories, I ran a Chi-square test of independence. My variables met some criteria for running a chi-square test of analysis. However,there was too low of other gender participants, and highest income (level 9) that did not meet the criteria of at least five participants per cell. I dropped the other sex participants from my sample and combined highest income (level 9) with the second highest income (level 8). The final sample for analysis can be seen in Table 1:
| Â income2Â | ||||||||
|---|---|---|---|---|---|---|---|---|
| Â 1Â | Â 2Â | Â 3Â | Â 4Â | Â 5Â | Â 6Â | Â 7Â | Â 8Â | |
|  sex | ||||||||
| Â Â Â 1Â | 210 | 118 | 100 | 79 | 65 | 61 | 117 | 39 |
| Â Â Â 2Â | 630 | 387 | 257 | 260 | 234 | 170 | 271 | 105 |
As predicted, I did find an income difference across participant’s gender categories, χ2(7, N = 3124) = 10.318, p = .171.
There is a statistically significant correlation between sex 1 (male) and the income 7 (income of 100,000-199,999). Men are over represented in the data. This supports my hypothesis that there is a correlation between gender and income, due to this response being significantly dominated by Male participants.
I predict that women will report more stress than men, as measured by the perceived stress questionnaire.
# you only need to check the variables you're using in the current analysis
# although you checked them previously, it's always a good idea to look them over again and be sure that everything is correct
str(d)
## 'data.frame': 3124 obs. of 7 variables:
## $ sex : Factor w/ 2 levels "1","2": 2 1 1 2 1 2 2 2 2 2 ...
## $ income : Factor w/ 9 levels "1","2","3","4",..: 3 3 1 1 6 1 2 3 7 1 ...
## $ belong : num 2.6 4.2 3.8 4.2 3.4 4.2 4.3 3.8 2.9 2.5 ...
## $ stress : num 3.1 3.8 4.3 3 3.3 3.7 3.4 2.2 2.9 2.6 ...
## $ swb : num 4.33 4.17 1.83 5.17 3.67 ...
## $ SocMedia: num 4.27 2.09 3.09 3.18 3.36 ...
## $ income2 : Factor w/ 8 levels "1","2","3","4",..: 3 3 1 1 6 1 2 3 7 1 ...
d$stress <- as.numeric(d$stress)
# you can use the describe() command on an entire datafrom (d) or just on a single variable (d$pss)
describe(d$stress)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 3117 3.06 0.66 3.1 3.06 0.59 1 5 4 0.03 -0.04 0.01
# also use a histogram to examine your continuous variable
hist(d$stress)
# can use the describeBy() command to view the means and standard deviations by group
# it's very similar to the describe() command but splits the dataframe according to the 'group' variable
describeBy(d$stress, group=d$sex)
##
## Descriptive statistics by group
## group: 1
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 791 2.89 0.67 2.9 2.88 0.59 1.2 4.8 3.6 0.11 -0.13 0.02
## ------------------------------------------------------------
## group: 2
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 2326 3.12 0.65 3.1 3.11 0.59 1 5 4 0.03 0 0.01
# last, use a boxplot to examine your continuous and categorical variables together
boxplot(d$stress~d$sex)
We can test whether the variances of our two groups are equal using Levene’s test. The null hypothesis is that the variance between the two groups is equal, which is the result we want. So when running Levene’s test we’re hoping for a non-significant result!
# use the leveneTest() command from the car package to test homogeneity of variance
# uses the same 'formula' setup that we'll use for our t-test: formula is y~x, where y is our DV and x is our IV
leveneTest(stress~sex, data = d)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 1.8866 0.1697
## 3115
My data is significant under levenes test of homogeneity. To solve this issue in the I will switch to welch’s T-test.
My independent variable has more than two levels. To proceed with this analysis, I dropped the other sex participants from my sample. I will make a note to discuss this issue in my Method write-up and in my Discussion as a limitation of my study.
# very simple! we specify the dataframe alongside the variables instead of having a separate argument for the dataframe like we did for leveneTest()
t_output <- t.test(d$stress~d$sex)
t_output
##
## Welch Two Sample t-test
##
## data: d$stress by d$sex
## t = -8.5461, df = 1324.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
## -0.2867211 -0.1796631
## sample estimates:
## mean in group 1 mean in group 2
## 2.885209 3.118401
# once again, we use our formula to calculate cohen's d
d_output <- cohen.d(d$stress~d$sex)
d_output
##
## Cohen's d
##
## d estimate: -0.3580285 (small)
## 95 percent confidence interval:
## lower upper
## -0.4392202 -0.2768367
To test my hypothesis thatwomen will report more stress than men, as measured by the perceived stress questionnaire, I used an two-sample or independent t-test. For this test, I dropped the ‘other’ response for the Sex variable, as the T-test is limited to a two-group comparison. I tested the homogeneity of variance with Levene’s test and found signs of heterogeneity (p = .1697). This suggests that there is an increased chance of Type I error. To correct for this possible issue, I used Welch’s t-test, which does not assume homogeneity of variance. My data met all other assumptions of a t-test.
As predicted, I found that women (M = 22.18401) reported significantly higher stress than men (M = 19.85209); t(1324.2) = -8.5461, p < .002 (see Figure 1). The effect size was calculated using Cohen’s d, with a value of -.36 (small effect; Cohen, 1988).
References
Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences. New York, NY: Routledge Academic.