library(tidyverse)
library(magrittr)
library(haven)
library(corrplot)Statistics in R
Make sure you have these packages installed, and then . . .
Upload your SPSS file into Posit Cloud
Viewer Window . . . Upload . . . Then follow instructions
I installed a file called Gloria Thesis Data.sav
Next, get the working directory that you’ve been using:
getwd()[1] "/cloud/project"
Then use the result (in quotes) to set your directory so you can read in this file. You’d think R would already know, but it doesn’t. The directory will probably be the same as mine if you’re using Posit Cloud, but you might have to add a subfolder after project if you’ve got more than one project:
setwd("/cloud/project")Then use the haven package command to read the .sav file and give it a short name
GTD <- read_sav("Gloria Thesis Data.sav")To prove to yourself that GTD is a dataframe, type this in the console
is.data.frame(GTD)[1] TRUE
Okay, good!
Now, there is a lot of missing data in this file that really should be zeros. Here’s how to replace them.
GTD[is.na(GTD)] <- 0Let’s check to see that it worked.
view(GTD)1 Summary Information
Let’s look at a summary of every variable in GTD. It’s not pretty, but it can be helpful as a first step. This study measured attitudes and judgments of risk concerning gun violence. There are 3 DVs, once concerning “risk to”overall” risk, one for risk to “self” and one for risk to someone you know, aka, “other.”
summary(GTD) IPAddress Progress Duration__in_seconds_
Length:374 Min. : 21.00 Min. : 259.0
Class :character 1st Qu.:100.00 1st Qu.: 410.0
Mode :character Median :100.00 Median : 494.0
Mean : 99.66 Mean : 1113.4
3rd Qu.:100.00 3rd Qu.: 668.8
Max. :100.00 Max. :69553.0
RecordedDate LocationLatitude LocationLongitude
Min. :2022-08-02 09:55:30.00 Length:374 Length:374
1st Qu.:2022-11-10 00:16:52.75 Class :character Class :character
Median :2022-11-14 15:23:21.00 Mode :character Mode :character
Mean :2022-11-14 06:09:05.07
3rd Qu.:2022-11-17 12:16:12.00
Max. :2022-11-22 12:09:18.00
MCAngry MCFear DVoverall DVself
Min. :0.00 Min. :0.000 Min. : 0.00 Min. : 0.00
1st Qu.:1.00 1st Qu.:1.000 1st Qu.: 60.00 1st Qu.: 0.00
Median :3.00 Median :3.000 Median : 80.00 Median : 10.00
Mean :3.42 Mean :3.297 Mean : 73.18 Mean : 12.91
3rd Qu.:5.00 3rd Qu.:5.000 3rd Qu.:100.00 3rd Qu.: 20.00
Max. :7.00 Max. :7.000 Max. :100.00 Max. :100.00
DVother GPS1 GPS2 GPS3
Min. : 0.00 Min. :0.000 Min. :0.000 Min. :0.000
1st Qu.: 0.00 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:4.000
Median : 10.00 Median :4.000 Median :4.000 Median :4.000
Mean : 18.85 Mean :3.513 Mean :3.203 Mean :3.663
3rd Qu.: 30.00 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :100.00 Max. :4.000 Max. :4.000 Max. :4.000
GPS4 GPS5 GPS6 GPS7
Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
1st Qu.:3.000 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:3.000
Median :4.000 Median :2.000 Median :3.000 Median :4.000
Mean :3.233 Mean :2.273 Mean :2.703 Mean :3.324
3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:4.000
Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000
GPS8 Atts1 Atts2 Atts3
Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
1st Qu.:3.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:1.000
Median :4.000 Median :3.000 Median :2.000 Median :2.000
Mean :3.578 Mean :2.505 Mean :2.393 Mean :1.762
3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:2.000
Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000
Atts4 Atts5 Atts6 HM1
Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:5.000
Median :3.000 Median :2.000 Median :1.000 Median :7.000
Mean :2.497 Mean :1.949 Mean :1.639 Mean :5.888
3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:7.000
Max. :4.000 Max. :4.000 Max. :4.000 Max. :7.000
HM2 HM3 HM4 HM5 HM6
Min. :0.000 Min. :0.00 Min. :0.000 Min. :0.000 Min. :0.000
1st Qu.:3.250 1st Qu.:1.00 1st Qu.:2.000 1st Qu.:4.000 1st Qu.:3.000
Median :5.000 Median :2.00 Median :3.000 Median :5.000 Median :5.000
Mean :4.628 Mean :2.96 Mean :3.433 Mean :4.738 Mean :4.567
3rd Qu.:6.000 3rd Qu.:4.75 3rd Qu.:5.000 3rd Qu.:6.000 3rd Qu.:6.000
Max. :7.000 Max. :7.00 Max. :7.000 Max. :7.000 Max. :7.000
ExplA1_massshooting ExplA5_intrusion ExplA6_robbery ExplF1_massshootings
Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
1st Qu.:7.000 1st Qu.:5.000 1st Qu.:5.000 1st Qu.:6.000
Median :7.000 Median :6.000 Median :6.000 Median :7.000
Mean :6.401 Mean :5.602 Mean :5.513 Mean :6.184
3rd Qu.:7.000 3rd Qu.:7.000 3rd Qu.:7.000 3rd Qu.:7.000
Max. :7.000 Max. :7.000 Max. :7.000 Max. :7.000
ExplF5_intrusion ExplF6_robbery Race_Ethnicity Gender
Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
1st Qu.:5.000 1st Qu.:5.000 1st Qu.:3.000 1st Qu.:1.000
Median :6.000 Median :6.000 Median :5.000 Median :1.000
Mean :5.586 Mean :5.532 Mean :3.987 Mean :1.548
3rd Qu.:7.000 3rd Qu.:7.000 3rd Qu.:5.000 3rd Qu.:3.000
Max. :7.000 Max. :7.000 Max. :8.000 Max. :5.000
Education MaritalStatus Age PoliticalOrientation
Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:2.000
Median :2.000 Median :1.000 Median :1.000 Median :2.000
Mean :2.278 Mean :1.061 Mean :1.043 Mean :2.316
3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:3.000
Max. :8.000 Max. :4.000 Max. :4.000 Max. :4.000
Gunhome Gunused Condition
Min. :0.00 Min. :0.000 Min. :0.000
1st Qu.:2.00 1st Qu.:1.000 1st Qu.:0.000
Median :2.00 Median :2.000 Median :2.000
Mean :1.89 Mean :1.583 Mean :2.556
3rd Qu.:2.00 3rd Qu.:2.000 3rd Qu.:4.000
Max. :4.00 Max. :2.000 Max. :6.000
Since we don’t need a lot of these variables, let’s create a new smaller dataframe with just the variables we want:
GTD2 <- select(GTD, MCAngry,MCFear, DVoverall, DVother, DVself, Gunhome, Gunused)And so we don’t have to type GTD2$ in front of every variable, here’s how to “attach” a dataframe so your computer knows this is where to look:
attach(GTD2)2 Correlations
Now, let’s look at the correlation between ratings of anger and fear. If your data contain missing values, use the following R code to handle missing values by case-wise deletion. cor(x, method = “pearson”, use = “complete.obs”) BUT THESE MIGHT BE DEFAULT VALUES.
cor(x=GTD2$MCAngry, y=GTD2$MCFear, use = "pairwise.complete.obs", method = "pearson")[1] 0.4737702
Oh, wait, I didn’t have to do that, did I, because I already attached GTD2?
cor(MCAngry, MCFear, use = "pairwise.complete.obs", method = "pearson")[1] 0.4737702
And in fact, pairwise and pearson are the defaults (I THINK), so now it’s just this:
cor(MCAngry, MCFear)[1] 0.4737702
Even better is a matrix of all the variables in the whole (new):
cor(GTD2) MCAngry MCFear DVoverall DVother DVself Gunhome
MCAngry 1.00000000 0.47377019 0.2159950 0.175834206 0.120375641 0.070372255
MCFear 0.47377019 1.00000000 0.1435961 0.168307092 0.149707859 0.084748823
DVoverall 0.21599501 0.14359609 1.0000000 0.347850751 0.213863782 0.129168919
DVother 0.17583421 0.16830709 0.3478508 1.000000000 0.624384689 0.004974703
DVself 0.12037564 0.14970786 0.2138638 0.624384689 1.000000000 0.025131128
Gunhome 0.07037225 0.08474882 0.1291689 0.004974703 0.025131128 1.000000000
Gunused 0.08878952 0.18030260 0.1610172 -0.019367963 0.009816944 0.199272887
Gunused
MCAngry 0.088789516
MCFear 0.180302605
DVoverall 0.161017159
DVother -0.019367963
DVself 0.009816944
Gunhome 0.199272887
Gunused 1.000000000
Well, that’s pretty ugly. To clean it up, let’s give this correlation matrix a name, and then round all the values to 2 decimals
myCorMatrix = cor(GTD2)
round(myCorMatrix,2) MCAngry MCFear DVoverall DVother DVself Gunhome Gunused
MCAngry 1.00 0.47 0.22 0.18 0.12 0.07 0.09
MCFear 0.47 1.00 0.14 0.17 0.15 0.08 0.18
DVoverall 0.22 0.14 1.00 0.35 0.21 0.13 0.16
DVother 0.18 0.17 0.35 1.00 0.62 0.00 -0.02
DVself 0.12 0.15 0.21 0.62 1.00 0.03 0.01
Gunhome 0.07 0.08 0.13 0.00 0.03 1.00 0.20
Gunused 0.09 0.18 0.16 -0.02 0.01 0.20 1.00
# (Yay, vectorization!)You could also just type this:
round(cor(GTD2),2) MCAngry MCFear DVoverall DVother DVself Gunhome Gunused
MCAngry 1.00 0.47 0.22 0.18 0.12 0.07 0.09
MCFear 0.47 1.00 0.14 0.17 0.15 0.08 0.18
DVoverall 0.22 0.14 1.00 0.35 0.21 0.13 0.16
DVother 0.18 0.17 0.35 1.00 0.62 0.00 -0.02
DVself 0.12 0.15 0.21 0.62 1.00 0.03 0.01
Gunhome 0.07 0.08 0.13 0.00 0.03 1.00 0.20
Gunused 0.09 0.18 0.16 -0.02 0.01 0.20 1.00
Or if you prefer piping, this:
cor(GTD2) %>%
round(digits = 2) MCAngry MCFear DVoverall DVother DVself Gunhome Gunused
MCAngry 1.00 0.47 0.22 0.18 0.12 0.07 0.09
MCFear 0.47 1.00 0.14 0.17 0.15 0.08 0.18
DVoverall 0.22 0.14 1.00 0.35 0.21 0.13 0.16
DVother 0.18 0.17 0.35 1.00 0.62 0.00 -0.02
DVself 0.12 0.15 0.21 0.62 1.00 0.03 0.01
Gunhome 0.07 0.08 0.13 0.00 0.03 1.00 0.20
Gunused 0.09 0.18 0.16 -0.02 0.01 0.20 1.00
Now check this out:
# I had to install and load a package called "corrplot" to run this.
corrplot(myCorMatrix)I know, right?!! How about this?
palette = colorRampPalette(c("green", "white", "red")) (20)
heatmap(x = myCorMatrix, col = palette, symm = TRUE)Yeah, I took it too far! I”m not sure what’s going on here either. So, never mind.
3 Scatter Plots
You want to make sure your variables are related in a linear fashion. A scatterplot can confirm this, or show that you actually have a curvilinear relationship.
#Here's one way to write the syntax. I prefer this, because as you add more layers, it keeps the initial mapping separate. And you will, eventually start adding more layers as you get better at this.
ggplot(GTD2, aes(DVself,DVother)) + geom_point()#Here's another way.
ggplot(GTD2) + geom_point(aes(DVself,DVoverall))ggplot(GTD2) + geom_point(aes(DVother,DVoverall))Now, let’s make sure our DVs don’t have anything funky going on. Let’s do a summary just on them:
summary(DVoverall) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 60.00 80.00 73.18 100.00 100.00
summary(DVself) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 0.00 10.00 12.91 20.00 100.00
summary(DVother) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 0.00 10.00 18.85 30.00 100.00
So, Overall has a median of 80, but the other two have a median of 10! We’d better check boxplots:
boxplot(DVoverall)boxplot(DVself)boxplot(DVother)Now, let’s go back to our original dataset which had a variable called “Condition”
To do that, I have to detach GTD2, and then attach GTD.
detach(GTD2)
attach(GTD)Condition is my IV, so let’s look at it closely and then do some box and violin plots Let’s see what we’ve got:
summary(Condition) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 2.000 2.556 4.000 6.000
length(Condition)[1] 374
table(Condition, MCAngry) MCAngry
Condition 0 1 2 3 4 5 6 7
0 4 29 10 8 7 16 12 14
1 0 5 0 6 6 13 10 5
2 1 3 3 4 4 11 10 10
3 0 3 2 4 5 16 11 5
4 0 13 12 8 6 4 2 2
5 0 41 2 2 1 1 0 0
6 0 22 9 4 5 2 1 0
#I want to make sure it's treated as a factor so, . . .
COND <- as.factor(Condition)
#Then I use the new COND in ggplot . . .
ggplot(GTD) + geom_boxplot(aes(x = COND, y=MCAngry, color = COND))ggplot(GTD) + geom_violin(aes(x = COND, y=MCAngry, color = COND))ggplot(GTD, aes(x = COND, y=MCAngry, color = COND)) + geom_boxplot() + geom_jitter(color="black", size=0.4, alpha=0.9)ggplot(GTD, aes(x = COND, y=MCAngry, color = COND)) + geom_violin() + geom_jitter(color="black", size=0.4, alpha=0.9)Okay, I don’t think that COND 0 should even be in there. Let’s get rid of it by filtering:
GTD %>% filter(COND > ‘0’ ) %>% ggplot(aes(x = COND, y=MCAngry, color = COND)) + geom_boxplot() + geom_jitter(color=“black”, size=0.4, alpha=0.9) ***NOT YET WORKING***
4 Paired t-test
Let’s see if people are more angry or more fearful about guns. First we’ll ignore what condition they’re in. . .
t.test(MCAngry, MCFear, paired=TRUE)
Paired t-test
data: MCAngry and MCFear
t = 1.0943, df = 373, p-value = 0.2745
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-0.09801664 0.34400595
sample estimates:
mean difference
0.1229947
summary(MCAngry) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 1.00 3.00 3.42 5.00 7.00
summary(MCFear) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 1.000 3.000 3.297 5.000 7.000
5 One-way ANOVA
ggplot(GTD, aes(x = COND, y=MCFear, color = COND)) + geom_boxplot() + geom_jitter(color="black", size=0.4, alpha=0.9)one.way <- aov(MCFear ~ COND, data = GTD)
summary(one.way) Df Sum Sq Mean Sq F value Pr(>F)
COND 6 404.4 67.41 20.9 <2e-16 ***
Residuals 367 1183.6 3.23
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
6 Two-way ANOVA
ggplot(GTD, aes(x = COND, y=MCAngry, color = as.factor(Gender))) + geom_boxplot() + geom_jitter(color="black", size=0.4, alpha=0.9)two.way <- aov(MCAngry ~ COND + Gender, data = GTD)
summary(two.way) Df Sum Sq Mean Sq F value Pr(>F)
COND 6 555.7 92.61 28.469 <2e-16 ***
Gender 1 10.7 10.74 3.302 0.07 .
Residuals 366 1190.7 3.25
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
My god! Let’s clean this up by dropping the jittered points. . .
ggplot(GTD, aes(x = COND, y=MCAngry, color = as.factor(Gender))) + geom_boxplot()And let’s try to filter out unusual gender values . . .
BROKEN: bill <- filter(GTD, (Gender == 1 | Gender == 3))
ggplot(GTD, aes(x = COND, y=MCAngry, color = as.factor(bill))) + geom_boxplot()