This guide was created for you to easily reference easy code for R, as well as basic psychology functions!
Here are a list of packages that may be useful to you
library(haven): This package allows you to take a data file of any kind (if you do not have a .csv file) and read it into R. This is especially helpful with SPSS files.
library(psych): This package has basic psychology functions such as describe, which will be demonstrated later on.
library(car): This package let’s you run levene’s test, which tests the assumption of homogeneity of variance (good for t-tests, ANOVA, linear regression)
Here is how you install packages, and bring them into the script to use.
Helpful Hint: When you install a package, you always want to make sure that the name of the package is in quotation marks. When you load the package into the script using the library command, you do not use quotation marks. You should only need to install a package once, but will typically need to use the library command to load the package every time you start a new script.
options(repos = c(CRAN = "https://cran.r-project.org/"))
install.packages("psych")
##
## The downloaded binary packages are in
## /var/folders/l_/2dd6by550hg9d_bws929z9x40000gn/T//RtmpQ2so5D/downloaded_packages
library(psych)
Note: the first line of code is not necessary in your own script, this line was just necessary to create the guide.
Let’s bring in some data!
We’ll be using the IRSJ Lab Guide data file in this guide. When you bring in a real data set, the first thing you want to do is name your data frame. Typically, I just call mine data, but sometimes when you are working with multiple files it is helpful to label them with a describing characteristic, such as “school.data”
Here is the formula for reading in a data file (csv).
data <- read.csv("~/Desktop/IRSJ Lab Guide Data.csv")
Helpful Hint: The text that is inside the parentheses of read.csv is the file path so R can locate where your data file is. The easiest way to get this file path is to locate the file in your Finder (mac), and copy it, then paste it into R. This will paste the file path without you having to do guesswork. However if you wanted to write it out by hand, on mac the formula is “~/location/folder name/file name.”
Let’s examine our data!
The View function opens our data file in another window.
View(data)
The head function shows us the first few rows of our data
head(data)
## Sub_num Gender Major Reason Exp_cond Coffee Num_cups
## 1 1 female psychology advisor recommendation easy yes 0
## 2 2 female psychology personal interest easy no 0
## 3 3 female psychology program requirement easy no 0
## 4 4 female psychology program requirement easy no 0
## 5 5 female psychology program requirement easy no 1
## 6 6 female psychology program requirement moderate yes 1
## Phobia Prevmath Mathquiz Statquiz Exp_sqz Hr_base Hr_pre Hr_post Anx_base
## 1 1 3 43 6 7 71 68 65 17
## 2 1 4 49 9 11 73 75 68 17
## 3 4 1 26 8 8 69 76 72 19
## 4 4 0 29 7 8 72 73 78 19
## 5 10 1 31 6 6 71 83 74 26
## 6 4 1 20 7 6 70 71 76 12
## Anx_pre Anx_post
## 1 22 20
## 2 19 16
## 3 14 15
## 4 13 16
## 5 30 25
## 6 15 19
When you are planning to do statistical analysis, it’s important to know what kind of variables you are working with.
Character variables are typically categorical, text based variables such as Gender or Major.
Integer variables are typically numeric variables, either continuous (decimals) or discrete (whole numbers).
Let’s look at what variable types we have in this data set.
class(data$Gender)
## [1] "character"
class(data$Major)
## [1] "character"
class(data$Mathquiz)
## [1] "integer"
class(data$Phobia)
## [1] "integer"
In this data set, Gender and Major are character variables, and Math Quiz scores and Phobia levels are integer variables.
To analyze character variables, we must change them into factor variables.
data$Gender <- as.factor(data$Gender)
data$Major <- as.factor(data$Major)
Now we can check the class of Gender and Major to see if it changed.
class(data$Gender)
## [1] "factor"
class(data$Major)
## [1] "factor"
Let’s look at the levels of our factor variables.
levels(data$Gender)
## [1] "female" "male"
levels(data$Major)
## [1] "biology" "economics" "pre-med" "psychology" "sociology"
Often, it is helpful to subset data for analysis.
Anytime you are trying to look at specific pieces of data from a dataset, you use brackets. The first part of the brackets indicates the rows you want to look at (example: for Quiz Scores Male, the first part of the brackets shows that we want all the rows where the column Gender equals male). The second part of the brackets indicates the columns you want to look at (example: for Quiz Scores Male, the second part of the brackets shows that we want all of the Mathquiz column).
name.of.data.set[rows,columns]
Quiz.Scores.Male <- data[data$Gender == "male", "Mathquiz"]
Quiz.Scores.Male <- data.matrix(Quiz.Scores.Male)
Quiz.Scores.Female <- data[data$Gender == "female", "Mathquiz"]
Quiz.Scores.Female <- data.matrix(Quiz.Scores.Female)
Now that we have everything set up, we can plan our t-test, comparing math quiz scores by gender.
Let’s start by looking at descriptive statistics of our outcome variable, math quiz scores.
describe(data$Mathquiz)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 85 29.07 9.48 30 29.26 10.38 9 49 40 -0.19 -0.58 1.03
Here are some ways that we can plot the data visually.
When you are using the boxplot function, the first argument will be your numeric outcome variable, and the second (and further) argument(s) will be your factor categorical predictor variable.
Histograms can only be conducted on numeric outcome variables, and subsetted data matrix variables (shown later).
Boxplots
boxplot(data$Mathquiz ~ data$Gender)
boxplot(data$Mathquiz ~ data$Major)
boxplot(data$Mathquiz ~ data$Gender + data$Major)
Histograms
hist(data$Mathquiz)
After we have examined the data with boxplots and histograms, we can begin to test the assumptions of a t-test.
The three assumptions of normality are…
Normality
Homogeneity of Variance
Independence
Normality tests whether the data follows a normal distribution shape, and can be tested in 3 ways
1. QQ plots
par(mfrow=c(1,2))
qqnorm(Quiz.Scores.Female)
qqnorm(Quiz.Scores.Male)
2. Histograms
par(mfrow=c(1,2))
hist(Quiz.Scores.Female)
hist(Quiz.Scores.Male)
3. Shapiro’s Test
shapiro.test(Quiz.Scores.Female)
##
## Shapiro-Wilk normality test
##
## data: Quiz.Scores.Female
## W = 0.98534, p-value = 0.8141
shapiro.test(Quiz.Scores.Male)
##
## Shapiro-Wilk normality test
##
## data: Quiz.Scores.Male
## W = 0.94402, p-value = 0.05666
Homogeneity of variance tests whether the variances of the groups we are comparing are similar to one another, enough that it’s reasonable to compare the two groups. This can be tested using Levene’s Test.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
leveneTest(Mathquiz~Gender, data = data)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0.3196 0.5734
## 83
Independence does not have a particular test in R to use, but it is the statistical principle that states that the outcome of one event does not affect the outcome of another, meaning one observation is unrelated to all others. In the example of Male versus Female students math quiz grades, it is reasonable to assume independence is met, because the math quiz grade of one student is not likley to impact the outcome of another student.
When running a t-test you must also assume independence within groups. In this example, we can assume that a male student’s quiz grade is independent from another male students, as well as being independent from the female students grades.
Let’s run a t-test!
A t-test lets us compare how some numeric outcome may differ depending on groups we have in our sample. When conducting a t-test, we create two hypotheses.
The null hypothesis typically states that there are no differences in an outcome depending on groups in a sample. We are trying to disprove this hypothesis.
The alternative hypothesis typically states that there are differences in an outcome depending on groups in a sample. These can be in a specific direction (for example: female students score higher than male students on a math quiz), or non-specific in direction (for example: female and male students score differently on a math quiz).
There are 2 types of code you can use to run a t-test.
t.test(Mathquiz~Gender, data = data)
##
## Welch Two Sample t-test
##
## data: Mathquiz by Gender
## t = 0.93624, df = 79.56, p-value = 0.352
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
## -2.179679 6.052019
## sample estimates:
## mean in group female mean in group male
## 29.93617 28.00000
t.test(Quiz.Scores.Female, Quiz.Scores.Male, var.equal = TRUE, paired = FALSE, alternative = "two.sided")
##
## Two Sample t-test
##
## data: Quiz.Scores.Female and Quiz.Scores.Male
## t = 0.93547, df = 83, p-value = 0.3523
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.180438 6.052778
## sample estimates:
## mean of x mean of y
## 29.93617 28.00000
How to Interpret Results
T-Test gives us a t value (t), degrees of freedom (df), and a p-value. What we want to pay attention to is the p-value.
If the p-value is less than .05, we reject the null hypothesis that there are no differences in the outcome between the two groups being tested. In this example, we are trying to prove that there are differences in math quiz scores between male and female students.
Our p-value is 0.352, which is much larger than .05. We thus fail to reject the null hypothesis that there are not differences between the math quiz scores between male and female students. We never accept the null hypothesis, because of complicated statistical things, but this is a fact to know nonetheless.
Let’s run an ANOVA!
An ANOVA lets us test how some numeric outcome differs based on 1) the groups in our sample (like a t test) as well as on 2) a combination of group memberships. For example, the ANOVA we will be running tests how math quiz scores differ based on A- your gender, B- your major, and C- the combination of your major and gender.
aov <- aov(Mathquiz ~ Gender*Major, data = data)
summary(aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Gender 1 79 78.77 0.950 0.3329
## Major 4 937 234.24 2.824 0.0306 *
## Gender:Major 4 314 78.50 0.947 0.4420
## Residuals 75 6220 82.93
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 15 observations deleted due to missingness
How to Interpret Results
ANOVA gives us 3 types of each of these data points: Sums of squares (Sum Sq), Mean squares (Mean Sq), F value, degrees of freedom (Df), and a p-value (Pr(>F)). What we want to pay attention to is the p-value.
There are three questions we want to answer by looking at the ANOVA table
1. Are there main effects of gender on math quiz scores?
If the p-value is less than .05, we reject the null hypothesis that there are no differences in math quiz scores between male and female students, concluding that there are significant differences between gender.
Our p-value is 0.332, which is much larger than .05. We thus fail to reject the null hypothesis that there are not differences between the math quiz scores between male and female students.
There are NOT significant gender differences in math quiz scores
2. Are there main effects of major on math quiz scores?
If the p-value is less than .05, we reject the null hypothesis that there are no differences in math quiz scores between biology, economics, pre-med, psychology, and sociology students, concluding that there are significant differences between majors.
Our p-value is 0.030, which is smaller than .05. We thus reject the null hypothesis that there are not differences between the math quiz scores between biology, economics, pre-med, psychology, and sociology students.
There ARE significant differences in math quiz scores for different majors.
3. Is there an interaction between major and gender for math quiz scores?
If the p-value is less than .05, we reject the null hypothesis that there is no interaction in math quiz scores based on gender or major differences in students, concluding that there are significant differences based on major and gender.
Our p-value is 0.442, which is much larger than .05. We thus fail to reject the null hypothesis that there is no interaction in math quiz scores based on gender or major differences in students.
There is NOT a significant interaction between gender and major differences in math quiz scores.
After an ANOVA, we can run contrasts to see where the differences are in our significant main effects.
TukeyHSD(aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Mathquiz ~ Gender * Major, data = data)
##
## $Gender
## diff lwr upr p adj
## male-female -1.93617 -5.893848 2.021508 0.3329047
##
## $Major
## diff lwr upr p adj
## economics-biology 11.457447 0.7289011 22.185993 0.0304225
## pre-med-biology 7.026342 -1.0334585 15.086143 0.1168628
## psychology-biology 5.169362 -2.5781154 12.916839 0.3450248
## sociology-biology 3.356383 -6.0299118 12.742678 0.8547629
## pre-med-economics -4.431104 -15.0071941 6.144985 0.7676734
## psychology-economics -6.288085 -16.6281375 4.051967 0.4402758
## sociology-economics -8.101064 -19.7198337 3.517706 0.3011213
## psychology-pre-med -1.856981 -9.3919248 5.677963 0.9583057
## sociology-pre-med -3.669959 -12.8816103 5.541691 0.7988697
## sociology-psychology -1.812979 -10.7526387 7.126681 0.9794418
##
## $`Gender:Major`
## diff lwr upr p adj
## male:biology-female:biology -9.4777778 -23.12164556 4.166090 0.4234401
## female:economics-female:biology 5.9666667 -13.58091360 25.514247 0.9917011
## male:economics-female:biology 7.1000000 -9.16456992 23.364570 0.9159535
## female:pre-med-female:biology 2.1888889 -11.45497889 15.832757 0.9999499
## male:pre-med-female:biology 2.4666667 -10.24793251 15.181266 0.9997549
## female:psychology-female:biology 1.1750000 -10.79539934 13.145399 0.9999993
## male:psychology-female:biology 0.4111111 -13.23275667 14.054979 1.0000000
## female:sociology-female:biology 0.1888889 -13.45497889 13.832757 1.0000000
## male:sociology-female:biology -3.3666667 -22.91424693 16.180914 0.9999088
## female:economics-male:biology 15.4444444 -4.35215962 35.241049 0.2630374
## male:economics-male:biology 16.5777778 0.01475049 33.140805 0.0496099
## female:pre-med-male:biology 11.6666667 -2.33164631 25.664980 0.1856965
## male:pre-med-male:biology 11.9444444 -1.14977835 25.038667 0.1038884
## female:psychology-male:biology 10.6527778 -1.72009977 23.025655 0.1528756
## male:psychology-male:biology 9.8888889 -4.10942409 23.887202 0.3990754
## female:sociology-male:biology 9.6666667 -4.33164631 23.664980 0.4320443
## male:sociology-male:biology 6.1111111 -13.68549296 25.907715 0.9910037
## male:economics-female:economics 1.1333333 -20.55275989 22.819427 1.0000000
## female:pre-med-female:economics -3.7777778 -23.57438185 16.018826 0.9997861
## male:pre-med-female:economics -3.5000000 -22.66797947 15.667979 0.9998515
## female:psychology-female:economics -4.7916667 -23.47430205 13.890969 0.9977289
## male:psychology-female:economics -5.5555556 -25.35215962 14.241049 0.9955129
## female:sociology-female:economics -5.7777778 -25.57438185 14.018826 0.9940055
## male:sociology-female:economics -9.3333333 -33.57912264 14.912456 0.9602528
## female:pre-med-male:economics -4.9111111 -21.47413840 11.651916 0.9932715
## male:pre-med-male:economics -4.6333333 -20.43965413 11.172987 0.9938112
## female:psychology-male:economics -5.9250000 -21.13911204 9.289112 0.9572678
## male:psychology-male:economics -6.6888889 -23.25191617 9.874138 0.9466255
## female:sociology-male:economics -6.9111111 -23.47413840 9.651916 0.9351649
## male:sociology-male:economics -10.4666667 -32.15275989 11.219427 0.8563221
## male:pre-med-female:pre-med 0.2777778 -12.81644501 13.372001 1.0000000
## female:psychology-female:pre-med -1.0138889 -13.38676643 11.358989 0.9999999
## male:psychology-female:pre-med -1.7777778 -15.77609076 12.220535 0.9999932
## female:sociology-female:pre-med -2.0000000 -15.99831298 11.998313 0.9999813
## male:sociology-female:pre-med -5.5555556 -25.35215962 14.241049 0.9955129
## female:psychology-male:pre-med -1.2916667 -12.63159625 10.048263 0.9999974
## male:psychology-male:pre-med -2.0555556 -15.14977835 11.038667 0.9999583
## female:sociology-male:pre-med -2.2777778 -15.37200057 10.816445 0.9999009
## male:sociology-male:pre-med -5.8333333 -25.00131280 13.334646 0.9918775
## male:psychology-female:psychology -0.7638889 -13.13676643 11.608989 1.0000000
## female:sociology-female:psychology -0.9861111 -13.35898865 11.386766 0.9999999
## male:sociology-female:psychology -4.5416667 -23.22430205 14.140969 0.9985004
## female:sociology-male:psychology -0.2222222 -14.22053520 13.776091 1.0000000
## male:sociology-male:psychology -3.7777778 -23.57438185 16.018826 0.9997861
## male:sociology-female:sociology -3.5555556 -23.35215962 16.241049 0.9998706
How to Interpret Results
Tukey’s output gives us the difference (diff) between the lower (lwr) and upper (upr) bounds of the 95% confidence interval, as well as the p-value (p adj). We only need to pay attention to the p-value, and where the p-value is less than .05, the output tells us what groups significantly differ from one another in their effects.
First we have main effects of gender, which we know is non significant between male and female (p = 0.333)
Then we have main effects of major, which we know is significant. Let’s look through the p adj column to see which differences are significant.
Then we have interaction effects, which we know is not significant.
This wraps up the R guide!!’’’