Statistics in R

Author

Mark Pezzo

Published

June 12, 2023


Make sure you have these packages installed, and then . . .

library(tidyverse)
library(magrittr)
library(haven)
library(corrplot)

Upload your SPSS file into Posit Cloud

Viewer Window . . . Upload . . . Then follow instructions

I installed a file called Gloria Thesis Data.sav

Next, get the working directory that you’ve been using:

getwd()
[1] "/cloud/project"

Then use the result (in quotes) to set your directory so you can read in this file. You’d think R would already know, but it doesn’t. The directory will probably be the same as mine if you’re using Posit Cloud, but you might have to add a subfolder after project if you’ve got more than one project:

setwd("/cloud/project")

Then use the haven package command to read the .sav file and give it a short name

GTD <- read_sav("Gloria Thesis Data.sav")

To prove to yourself that GTD is a dataframe, type this in the console

is.data.frame(GTD)
[1] TRUE

Okay, good!

Now, there is a lot of missing data in this file that really should be zeros. Here’s how to replace them.

GTD[is.na(GTD)] <- 0

Let’s check to see that it worked.

view(GTD)

1 Summary Information

Let’s look at a summary of every variable in GTD. It’s not pretty, but it can be helpful as a first step. This study measured attitudes and judgments of risk concerning gun violence. There are 3 DVs, once concerning “risk to”overall” risk, one for risk to “self” and one for risk to someone you know, aka, “other.”

summary(GTD)
  IPAddress            Progress      Duration__in_seconds_
 Length:374         Min.   : 21.00   Min.   :  259.0      
 Class :character   1st Qu.:100.00   1st Qu.:  410.0      
 Mode  :character   Median :100.00   Median :  494.0      
                    Mean   : 99.66   Mean   : 1113.4      
                    3rd Qu.:100.00   3rd Qu.:  668.8      
                    Max.   :100.00   Max.   :69553.0      
  RecordedDate                    LocationLatitude   LocationLongitude 
 Min.   :2022-08-02 09:55:30.00   Length:374         Length:374        
 1st Qu.:2022-11-10 00:16:52.75   Class :character   Class :character  
 Median :2022-11-14 15:23:21.00   Mode  :character   Mode  :character  
 Mean   :2022-11-14 06:09:05.07                                        
 3rd Qu.:2022-11-17 12:16:12.00                                        
 Max.   :2022-11-22 12:09:18.00                                        
    MCAngry         MCFear        DVoverall          DVself      
 Min.   :0.00   Min.   :0.000   Min.   :  0.00   Min.   :  0.00  
 1st Qu.:1.00   1st Qu.:1.000   1st Qu.: 60.00   1st Qu.:  0.00  
 Median :3.00   Median :3.000   Median : 80.00   Median : 10.00  
 Mean   :3.42   Mean   :3.297   Mean   : 73.18   Mean   : 12.91  
 3rd Qu.:5.00   3rd Qu.:5.000   3rd Qu.:100.00   3rd Qu.: 20.00  
 Max.   :7.00   Max.   :7.000   Max.   :100.00   Max.   :100.00  
    DVother            GPS1            GPS2            GPS3      
 Min.   :  0.00   Min.   :0.000   Min.   :0.000   Min.   :0.000  
 1st Qu.:  0.00   1st Qu.:3.000   1st Qu.:3.000   1st Qu.:4.000  
 Median : 10.00   Median :4.000   Median :4.000   Median :4.000  
 Mean   : 18.85   Mean   :3.513   Mean   :3.203   Mean   :3.663  
 3rd Qu.: 30.00   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :100.00   Max.   :4.000   Max.   :4.000   Max.   :4.000  
      GPS4            GPS5            GPS6            GPS7      
 Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
 1st Qu.:3.000   1st Qu.:1.000   1st Qu.:2.000   1st Qu.:3.000  
 Median :4.000   Median :2.000   Median :3.000   Median :4.000  
 Mean   :3.233   Mean   :2.273   Mean   :2.703   Mean   :3.324  
 3rd Qu.:4.000   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:4.000  
 Max.   :4.000   Max.   :4.000   Max.   :4.000   Max.   :4.000  
      GPS8           Atts1           Atts2           Atts3      
 Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
 1st Qu.:3.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:1.000  
 Median :4.000   Median :3.000   Median :2.000   Median :2.000  
 Mean   :3.578   Mean   :2.505   Mean   :2.393   Mean   :1.762  
 3rd Qu.:4.000   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:2.000  
 Max.   :4.000   Max.   :4.000   Max.   :4.000   Max.   :4.000  
     Atts4           Atts5           Atts6            HM1       
 Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
 1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:5.000  
 Median :3.000   Median :2.000   Median :1.000   Median :7.000  
 Mean   :2.497   Mean   :1.949   Mean   :1.639   Mean   :5.888  
 3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:7.000  
 Max.   :4.000   Max.   :4.000   Max.   :4.000   Max.   :7.000  
      HM2             HM3            HM4             HM5             HM6       
 Min.   :0.000   Min.   :0.00   Min.   :0.000   Min.   :0.000   Min.   :0.000  
 1st Qu.:3.250   1st Qu.:1.00   1st Qu.:2.000   1st Qu.:4.000   1st Qu.:3.000  
 Median :5.000   Median :2.00   Median :3.000   Median :5.000   Median :5.000  
 Mean   :4.628   Mean   :2.96   Mean   :3.433   Mean   :4.738   Mean   :4.567  
 3rd Qu.:6.000   3rd Qu.:4.75   3rd Qu.:5.000   3rd Qu.:6.000   3rd Qu.:6.000  
 Max.   :7.000   Max.   :7.00   Max.   :7.000   Max.   :7.000   Max.   :7.000  
 ExplA1_massshooting ExplA5_intrusion ExplA6_robbery  ExplF1_massshootings
 Min.   :0.000       Min.   :0.000    Min.   :0.000   Min.   :0.000       
 1st Qu.:7.000       1st Qu.:5.000    1st Qu.:5.000   1st Qu.:6.000       
 Median :7.000       Median :6.000    Median :6.000   Median :7.000       
 Mean   :6.401       Mean   :5.602    Mean   :5.513   Mean   :6.184       
 3rd Qu.:7.000       3rd Qu.:7.000    3rd Qu.:7.000   3rd Qu.:7.000       
 Max.   :7.000       Max.   :7.000    Max.   :7.000   Max.   :7.000       
 ExplF5_intrusion ExplF6_robbery  Race_Ethnicity      Gender     
 Min.   :0.000    Min.   :0.000   Min.   :0.000   Min.   :0.000  
 1st Qu.:5.000    1st Qu.:5.000   1st Qu.:3.000   1st Qu.:1.000  
 Median :6.000    Median :6.000   Median :5.000   Median :1.000  
 Mean   :5.586    Mean   :5.532   Mean   :3.987   Mean   :1.548  
 3rd Qu.:7.000    3rd Qu.:7.000   3rd Qu.:5.000   3rd Qu.:3.000  
 Max.   :7.000    Max.   :7.000   Max.   :8.000   Max.   :5.000  
   Education     MaritalStatus        Age        PoliticalOrientation
 Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000       
 1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:2.000       
 Median :2.000   Median :1.000   Median :1.000   Median :2.000       
 Mean   :2.278   Mean   :1.061   Mean   :1.043   Mean   :2.316       
 3rd Qu.:3.000   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:3.000       
 Max.   :8.000   Max.   :4.000   Max.   :4.000   Max.   :4.000       
    Gunhome        Gunused        Condition    
 Min.   :0.00   Min.   :0.000   Min.   :0.000  
 1st Qu.:2.00   1st Qu.:1.000   1st Qu.:0.000  
 Median :2.00   Median :2.000   Median :2.000  
 Mean   :1.89   Mean   :1.583   Mean   :2.556  
 3rd Qu.:2.00   3rd Qu.:2.000   3rd Qu.:4.000  
 Max.   :4.00   Max.   :2.000   Max.   :6.000  

Since we don’t need a lot of these variables, let’s create a new smaller dataframe with just the variables we want:

GTD2 <- select(GTD, MCAngry,MCFear, DVoverall, DVother, DVself, Gunhome, Gunused)

And so we don’t have to type GTD2$ in front of every variable, here’s how to “attach” a dataframe so your computer knows this is where to look:

attach(GTD2)

2 Correlations

Now, let’s look at the correlation between ratings of anger and fear. If your data contain missing values, use the following R code to handle missing values by case-wise deletion. cor(x, method = “pearson”, use = “complete.obs”) BUT THESE MIGHT BE DEFAULT VALUES.

cor(x=GTD2$MCAngry, y=GTD2$MCFear, use = "pairwise.complete.obs", method = "pearson")
[1] 0.4737702

Oh, wait, I didn’t have to do that, did I, because I already attached GTD2?

cor(MCAngry, MCFear, use = "pairwise.complete.obs", method = "pearson")
[1] 0.4737702

And in fact, pairwise and pearson are the defaults (I THINK), so now it’s just this:

cor(MCAngry, MCFear)
[1] 0.4737702

Even better is a matrix of all the variables in the whole (new):

cor(GTD2)
             MCAngry     MCFear DVoverall      DVother      DVself     Gunhome
MCAngry   1.00000000 0.47377019 0.2159950  0.175834206 0.120375641 0.070372255
MCFear    0.47377019 1.00000000 0.1435961  0.168307092 0.149707859 0.084748823
DVoverall 0.21599501 0.14359609 1.0000000  0.347850751 0.213863782 0.129168919
DVother   0.17583421 0.16830709 0.3478508  1.000000000 0.624384689 0.004974703
DVself    0.12037564 0.14970786 0.2138638  0.624384689 1.000000000 0.025131128
Gunhome   0.07037225 0.08474882 0.1291689  0.004974703 0.025131128 1.000000000
Gunused   0.08878952 0.18030260 0.1610172 -0.019367963 0.009816944 0.199272887
               Gunused
MCAngry    0.088789516
MCFear     0.180302605
DVoverall  0.161017159
DVother   -0.019367963
DVself     0.009816944
Gunhome    0.199272887
Gunused    1.000000000

Well, that’s pretty ugly. To clean it up, let’s give this correlation matrix a name, and then round all the values to 2 decimals

myCorMatrix = cor(GTD2)
round(myCorMatrix,2)
          MCAngry MCFear DVoverall DVother DVself Gunhome Gunused
MCAngry      1.00   0.47      0.22    0.18   0.12    0.07    0.09
MCFear       0.47   1.00      0.14    0.17   0.15    0.08    0.18
DVoverall    0.22   0.14      1.00    0.35   0.21    0.13    0.16
DVother      0.18   0.17      0.35    1.00   0.62    0.00   -0.02
DVself       0.12   0.15      0.21    0.62   1.00    0.03    0.01
Gunhome      0.07   0.08      0.13    0.00   0.03    1.00    0.20
Gunused      0.09   0.18      0.16   -0.02   0.01    0.20    1.00
# (Yay, vectorization!)

You could also just type this:

round(cor(GTD2),2)
          MCAngry MCFear DVoverall DVother DVself Gunhome Gunused
MCAngry      1.00   0.47      0.22    0.18   0.12    0.07    0.09
MCFear       0.47   1.00      0.14    0.17   0.15    0.08    0.18
DVoverall    0.22   0.14      1.00    0.35   0.21    0.13    0.16
DVother      0.18   0.17      0.35    1.00   0.62    0.00   -0.02
DVself       0.12   0.15      0.21    0.62   1.00    0.03    0.01
Gunhome      0.07   0.08      0.13    0.00   0.03    1.00    0.20
Gunused      0.09   0.18      0.16   -0.02   0.01    0.20    1.00

Or if you prefer piping, this:

cor(GTD2) %>%
  round(digits = 2)
          MCAngry MCFear DVoverall DVother DVself Gunhome Gunused
MCAngry      1.00   0.47      0.22    0.18   0.12    0.07    0.09
MCFear       0.47   1.00      0.14    0.17   0.15    0.08    0.18
DVoverall    0.22   0.14      1.00    0.35   0.21    0.13    0.16
DVother      0.18   0.17      0.35    1.00   0.62    0.00   -0.02
DVself       0.12   0.15      0.21    0.62   1.00    0.03    0.01
Gunhome      0.07   0.08      0.13    0.00   0.03    1.00    0.20
Gunused      0.09   0.18      0.16   -0.02   0.01    0.20    1.00

Now check this out:

# I had to install and load a package called "corrplot" to run this.
corrplot(myCorMatrix)

I know, right?!! How about this?

palette = colorRampPalette(c("green", "white", "red")) (20)
heatmap(x = myCorMatrix, col = palette, symm = TRUE)

Yeah, I took it too far! I”m not sure what’s going on here either. So, never mind.

3 Scatter Plots

You want to make sure your variables are related in a linear fashion. A scatterplot can confirm this, or show that you actually have a curvilinear relationship.

#Here's one way to write the syntax.  I prefer this, because as you add more layers, it keeps the initial mapping separate.  And you will, eventually start adding more layers as you get better at this.
ggplot(GTD2, aes(DVself,DVother)) + geom_point()

#Here's another way.   
ggplot(GTD2) + geom_point(aes(DVself,DVoverall))

ggplot(GTD2) + geom_point(aes(DVother,DVoverall))

Now, let’s make sure our DVs don’t have anything funky going on. Let’s do a summary just on them:

summary(DVoverall)   
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   60.00   80.00   73.18  100.00  100.00 
summary(DVself)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    0.00   10.00   12.91   20.00  100.00 
summary(DVother)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    0.00   10.00   18.85   30.00  100.00 

So, Overall has a median of 80, but the other two have a median of 10! We’d better check boxplots:

boxplot(DVoverall)

boxplot(DVself)

boxplot(DVother)

Now, let’s go back to our original dataset which had a variable called “Condition”
To do that, I have to detach GTD2, and then attach GTD.

detach(GTD2)
attach(GTD)

Condition is my IV, so let’s look at it closely and then do some box and violin plots Let’s see what we’ve got:

summary(Condition)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   2.000   2.556   4.000   6.000 
length(Condition)
[1] 374
table(Condition, MCAngry)
         MCAngry
Condition  0  1  2  3  4  5  6  7
        0  4 29 10  8  7 16 12 14
        1  0  5  0  6  6 13 10  5
        2  1  3  3  4  4 11 10 10
        3  0  3  2  4  5 16 11  5
        4  0 13 12  8  6  4  2  2
        5  0 41  2  2  1  1  0  0
        6  0 22  9  4  5  2  1  0
#I want to make sure it's treated as a factor so, . . . 

COND <- as.factor(Condition)

#Then I use the new COND in ggplot . . .

ggplot(GTD) + geom_boxplot(aes(x = COND, y=MCAngry, color = COND))

ggplot(GTD) + geom_violin(aes(x = COND, y=MCAngry, color = COND))

ggplot(GTD, aes(x = COND, y=MCAngry, color = COND)) + geom_boxplot() + geom_jitter(color="black", size=0.4, alpha=0.9)

ggplot(GTD, aes(x = COND, y=MCAngry, color = COND)) + geom_violin() + geom_jitter(color="black", size=0.4, alpha=0.9)

Okay, I don’t think that COND 0 should even be in there. Let’s get rid of it by filtering:

GTD %>% filter(COND > ‘0’ ) %>% ggplot(aes(x = COND, y=MCAngry, color = COND)) + geom_boxplot() + geom_jitter(color=“black”, size=0.4, alpha=0.9) ***NOT YET WORKING***

4 Paired t-test

Let’s see if people are more angry or more fearful about guns. First we’ll ignore what condition they’re in. . .

t.test(MCAngry, MCFear, paired=TRUE)

    Paired t-test

data:  MCAngry and MCFear
t = 1.0943, df = 373, p-value = 0.2745
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -0.09801664  0.34400595
sample estimates:
mean difference 
      0.1229947 
summary(MCAngry)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    1.00    3.00    3.42    5.00    7.00 
summary(MCFear)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   1.000   3.000   3.297   5.000   7.000 

5 One-way ANOVA

ggplot(GTD, aes(x = COND, y=MCFear, color = COND)) + geom_boxplot() + geom_jitter(color="black", size=0.4, alpha=0.9)

one.way <- aov(MCFear ~ COND, data = GTD)

summary(one.way)
             Df Sum Sq Mean Sq F value Pr(>F)    
COND          6  404.4   67.41    20.9 <2e-16 ***
Residuals   367 1183.6    3.23                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

6 Two-way ANOVA

ggplot(GTD, aes(x = COND, y=MCAngry, color = as.factor(Gender))) + geom_boxplot() + geom_jitter(color="black", size=0.4, alpha=0.9)

two.way <- aov(MCAngry ~ COND + Gender, data = GTD)

summary(two.way)
             Df Sum Sq Mean Sq F value Pr(>F)    
COND          6  555.7   92.61  28.469 <2e-16 ***
Gender        1   10.7   10.74   3.302   0.07 .  
Residuals   366 1190.7    3.25                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

My god! Let’s clean this up by dropping the jittered points. . .

ggplot(GTD, aes(x = COND, y=MCAngry, color = as.factor(Gender))) + geom_boxplot()

And let’s try to filter out unusual gender values . . .

BROKEN: bill <- filter(GTD, (Gender == 1 | Gender == 3))

ggplot(GTD, aes(x = COND, y=MCAngry, color = as.factor(bill))) + geom_boxplot()