1 Loading Libraries

library(expss) # for the cross_cases() command

## Loading required package: maditr

## 
## To drop variable use NULL: let(mtcars, am = NULL) %>% head()

library(psych) # for the describe() command
library(car) # for the leveneTest() command

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

## The following object is masked from 'package:expss':
## 
##     recode

library(effsize) # for the cohen.d() command

## 
## Attaching package: 'effsize'

## The following object is masked from 'package:psych':
## 
##     cohen.d

2 Importing Data

# import the dataset you cleaned previously
# this will be the dataset you'll use throughout the rest of the semester
d <- read.csv(file="Data/final.csv", header=T)

3 Chi-square: State Your Hypothesis

There will be differences in participants’ income across the sex categories (in other words, participants income levels will not be evenly distributed across the sex categories).

4 Chi-square:Check Your Variables

# you only need to check the variables you're using in the current analysis
# although you checked them previously, it's always a good idea to look them over again and be sure that everything is correct
str(d)

## 'data.frame':    3182 obs. of  6 variables:
##  $ sex     : int  2 1 1 2 1 2 2 2 2 2 ...
##  $ income  : int  3 3 1 1 6 1 2 3 7 1 ...
##  $ belong  : num  2.6 4.2 3.8 4.2 3.4 4.2 4.3 3.8 2.9 2.5 ...
##  $ stress  : num  3.1 3.8 4.3 3 3.3 3.7 3.4 2.2 2.9 2.6 ...
##  $ swb     : num  4.33 4.17 1.83 5.17 3.67 ...
##  $ SocMedia: num  4.27 2.09 3.09 3.18 3.36 ...

# we can see in the str() command that our categorical variables are being read as character or string variables
# to correct this, we'll use the as.factor() command


d$sex <- as.factor(d$sex)

d$income <- as.factor(d$income)

table(d$sex, useNA = "always")

## 
##    1    2    3 <NA> 
##  792 2332   54    4

table(d$income, useNA = "always")

## 
##    1    2    3    4    5    6    7    8    9 <NA> 
##  860  518  361  344  302  236  389  140    7   25

cross_cases(d, sex, income)

	income
	1	2	3	4	5	6	7	8	9
sex
1	210	118	100	79	65	61	117	37	2
2	630	387	257	260	234	170	271	100	5
3	20	13	4	5	3	5	1	3
#Total cases	860	518	361	344	302	236	389	140	7

5 Chi-square:Check Your Assumptions

5.1 Chi-square Test Assumptions

Data should be frequencies or counts
Variables and levels should be independent
There are two variables
At least 5 or more participants per cell

5.2 Issues with My Data

While my data meets the first three assumptions, I don’t have at least 5 participants in all cells. The number of other sex participants is small, and for the income category of 9, it is small. This created an issue with having 5 participants per cell.

To proceed with this analysis, I will drop the other sex participants from my sample and add the 9 income (1,000,000+) participants to the 8 category (200,000-1,000,000). I will make a note to discuss this issue in my Method write-up and in my Discussion as a limitation of my study.

# we'll use the subset command to drop our non-binary participants
d <- subset(d, sex != "3") 
#using the '!=' sign here tells R to filter out the indicated criteria
# once we've dropped a level from our factor, we need to use the droplevels() command to remove it, or it will still show as 0

table(d$sex, useNA = "always")

## 
##    1    2    3 <NA> 
##  792 2332    0    0

d$sex<- droplevels(d$sex)

# we'll recode our race variable to combine our native american participants with our other participants
d$income2 <- d$income 
# create a new variable (race_rc2 identical to current variable (race_rc)

d$income2[d$income == "9"] <- "8" 
table(d$income2, useNA = "always")

## 
##    1    2    3    4    5    6    7    8    9 <NA> 
##  840  505  357  339  299  231  388  144    0   21

# we will use some of our previous code to recode our Native American participants
d$income2 <- droplevels(d$income2)
table(d$income2, useNA = "always")

## 
##    1    2    3    4    5    6    7    8 <NA> 
##  840  505  357  339  299  231  388  144   21

# since I made changes to my variables, I am going to re-run the cross_cases() command
cross_cases(d, sex, income2)

	income2
	1	2	3	4	5	6	7	8
sex
1	210	118	100	79	65	61	117	39
2	630	387	257	260	234	170	271	105
#Total cases	840	505	357	339	299	231	388	144

6 Chi-square:Run a Chi-square Test

# we use the chisq.test() command to run our chi-square test
# the only arguments we need to specify are the variables we're using for the chi-square test
# we are saving the output from our chi-square test to the chi_output object so we can view it again later
chi_output <- chisq.test(d$sex, d$income2)

7 Chi-square:View Test Output

# to view the results of our chi-square test, we just have to call up the output we saved
chi_output

## 
##  Pearson's Chi-squared test
## 
## data:  d$sex and d$income2
## X-squared = 10.318, df = 7, p-value = 0.1712

8 Chi-square:View Standardized Residuals

# to view the standardized residuals, we use the $ operator to access the stdres element of the chi_output file that we created
chi_output$stdres

##      d$income2
## d$sex          1          2          3          4          5          6
##     1 -0.3328005 -1.1622162  1.1919613 -0.9511924 -1.5405484  0.3555142
##     2  0.3328005  1.1622162 -1.1919613  0.9511924  1.5405484 -0.3555142
##      d$income2
## d$sex          7          8
##     1  2.2862683  0.4674201
##     2 -2.2862683 -0.4674201

9 Chi-square:Write Up Results

To test my hypothesis that there will be differences in participants’ income across the sex categories, I ran a Chi-square test of independence. My variables met some criteria for running a chi-square test of analysis. However,there was too low of other gender participants, and highest income (level 9) that did not meet the criteria of at least five participants per cell. I dropped the other sex participants from my sample and combined highest income (level 9) with the second highest income (level 8). The final sample for analysis can be seen in Table 1:

	income2
	1	2	3	4	5	6	7	8
sex
1	210	118	100	79	65	61	117	39
2	630	387	257	260	234	170	271	105

As predicted, I did find an income difference across participant’s gender categories, χ²(7, N = 3124) = 10.318, p = .171.

There is a statistically significant correlation between sex 1 (male) and the income 7 (income of 100,000-199,999). Men are over represented in the data. This supports my hypothesis that there is a correlation between gender and income, due to this response being significantly dominated by Male participants.

10 T-test: State Your Hypothesis

I predict that women will report more stress than men, as measured by the perceived stress questionnaire.

11 T-test: Check Your Variables

# you only need to check the variables you're using in the current analysis
# although you checked them previously, it's always a good idea to look them over again and be sure that everything is correct
str(d)

## 'data.frame':    3124 obs. of  7 variables:
##  $ sex     : Factor w/ 2 levels "1","2": 2 1 1 2 1 2 2 2 2 2 ...
##  $ income  : Factor w/ 9 levels "1","2","3","4",..: 3 3 1 1 6 1 2 3 7 1 ...
##  $ belong  : num  2.6 4.2 3.8 4.2 3.4 4.2 4.3 3.8 2.9 2.5 ...
##  $ stress  : num  3.1 3.8 4.3 3 3.3 3.7 3.4 2.2 2.9 2.6 ...
##  $ swb     : num  4.33 4.17 1.83 5.17 3.67 ...
##  $ SocMedia: num  4.27 2.09 3.09 3.18 3.36 ...
##  $ income2 : Factor w/ 8 levels "1","2","3","4",..: 3 3 1 1 6 1 2 3 7 1 ...

d$stress <- as.numeric(d$stress)


# you can use the describe() command on an entire datafrom (d) or just on a single variable (d$pss)
describe(d$stress)

##    vars    n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 3117 3.06 0.66    3.1    3.06 0.59   1   5     4 0.03    -0.04 0.01

# also use a histogram to examine your continuous variable
hist(d$stress)

# can use the describeBy() command to view the means and standard deviations by group
# it's very similar to the describe() command but splits the dataframe according to the 'group' variable
describeBy(d$stress, group=d$sex)

## 
##  Descriptive statistics by group 
## group: 1
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 791 2.89 0.67    2.9    2.88 0.59 1.2 4.8   3.6 0.11    -0.13 0.02
## ------------------------------------------------------------ 
## group: 2
##    vars    n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 2326 3.12 0.65    3.1    3.11 0.59   1   5     4 0.03        0 0.01

# last, use a boxplot to examine your continuous and categorical variables together
boxplot(d$stress~d$sex)

12 T-test: Check Your Assumptions

12.1 T-test Assumptions

IV must have two levels
Data values must be independent (independent t-test only)
Data obtained via a random sample
Dependent variable must be normally distributed
Variances of the two groups are approximately equal

12.2 Testing Homogeneity of Variance with Levene’s Test

We can test whether the variances of our two groups are equal using Levene’s test. The null hypothesis is that the variance between the two groups is equal, which is the result we want. So when running Levene’s test we’re hoping for a non-significant result!

# use the leveneTest() command from the car package to test homogeneity of variance
# uses the same 'formula' setup that we'll use for our t-test: formula is y~x, where y is our DV and x is our IV
leveneTest(stress~sex, data = d)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    1  1.8866 0.1697
##       3115

12.3 Issues with My Data

My data is significant under levenes test of homogeneity. To solve this issue in the I will switch to welch’s T-test.

My independent variable has more than two levels. To proceed with this analysis, I dropped the other sex participants from my sample. I will make a note to discuss this issue in my Method write-up and in my Discussion as a limitation of my study.

13 T-test: Run a T-test

# very simple! we specify the dataframe alongside the variables instead of having a separate argument for the dataframe like we did for leveneTest()
t_output <- t.test(d$stress~d$sex)

14 T-test: View Test Output

t_output

## 
##  Welch Two Sample t-test
## 
## data:  d$stress by d$sex
## t = -8.5461, df = 1324.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
##  -0.2867211 -0.1796631
## sample estimates:
## mean in group 1 mean in group 2 
##        2.885209        3.118401

15 T-test: Calculate Cohen’s d

# once again, we use our formula to calculate cohen's d
d_output <- cohen.d(d$stress~d$sex)

16 T-test: View Effect Size

d_output

## 
## Cohen's d
## 
## d estimate: -0.3580285 (small)
## 95 percent confidence interval:
##      lower      upper 
## -0.4392202 -0.2768367

17 T-test: Write Up Results

To test my hypothesis thatwomen will report more stress than men, as measured by the perceived stress questionnaire, I used an two-sample or independent t-test. For this test, I dropped the ‘other’ response for the Sex variable, as the T-test is limited to a two-group comparison. I tested the homogeneity of variance with Levene’s test and found signs of heterogeneity (p = .1697). This suggests that there is an increased chance of Type I error. To correct for this possible issue, I used Welch’s t-test, which does not assume homogeneity of variance. My data met all other assumptions of a t-test.

As predicted, I found that women (M = 22.18401) reported significantly higher stress than men (M = 19.85209); t(1324.2) = -8.5461, p < .002 (see Figure 1). The effect size was calculated using Cohen’s d, with a value of -.36 (small effect; Cohen, 1988).

References

Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences. New York, NY: Routledge Academic.

Chi-square and T-test Homework

Sophia Freeland

2023-05-30