The Basics

This cheat sheet contains commands I either find myself googling frequently, or are just general good functions/commands to be familiar with for a variety of analyses! I will likely update this over time.

Reading In & Outputting Data:

Read in a delimited text file

df <- read.table(‘file.txt’) 
#if you have missing codes, can do:
df <- read.table(‘file.txt’, missing = -99)
#if you have multiple missing codes, can do:
df <- read.table(‘file.txt’, na.strings=c("-98","-99", "-999"))

Similarly, read in a csv file

df <- read.csv(‘file.csv’)

Read OUT a delimited text file

write.table(df, ‘file.txt’) #saves this to your working directory

Read OUT a csv file

write.csv(df, ‘file.csv’)

Handling SPSS Files (Using the Foreign package):

Import an SPSS data file into R:

spss.dat<- read.spss(spss.filename, use.value.labels=TRUE, to.data.frame=TRUE,
          max.value.labels=Inf, trim.factor.names=FALSE)

Write an R-based datset to SPSS file using ‘foreign’:

library(foreign)
write.foreign(data.frame, #dataframe name in R
              "mydata.txt", #original dataframe file
              "mydata1.sps", #name you want your SPSS file to be called
              package="SPSS")

Checking Data Structure

Checking Structure/Dimensions of a DF:

Check structure/dimension of a variable of df

str(dat) #check structure of data
dim(dat) #check the number of dimensions of data
class(dat$vbl) #check what type of class a vbl is

Print First or Last 10 Rows of the Data

head(mydata, n=10) #first
tail(mydata, n=10) #last

Quick Sorting of Data:

#Sort dataset by ID
df[order(df$ID),]

#Sort dataset by Item1 and Item2 
df[order(df$Item1,df$Item2),]

#Sort dataset by Item1 (ascending) and Item2 (descending)
df[order(df$Item1, -df$Item2),]

Other Helpful Data Commands:

Re-naming Variables in a Dataframe:

View Names in df

names(df) 
#or
dput(names(df))

Re-name all names in dataset, generally speaking

names(df)<- c("whatever1", "whatever2","whatever3")

Re-name only Row Names:

# For Row Names
rownames(df)<- c("v1","v2","v3")
#or more simply
rownames(df)<- paste0("V", 1:3)

Re-Name only Columns

colnames(df)<- c("var1","var2","var3")

Changing The Class of Variables in R:

Setting a Variable as Ordinal/Numeric/Categorical

as.ordered(df$x) #x = some variable 
as.numeric(df$x)
as.factor(df$x)

Change the level names of a categorical variable

data$vbl.name<- factor(data$vbl.name, levels = c(0,1), labels = c("Male","Female"))

Change ALL variables to be all numeric:

new.df <- data.frame(lapply(old.df, function(x) as.numeric(as.character(x))))

Descriptives & Frequencies

Base R - Descriptive/Frequency Commands:

# x = some variable

sd(df$x)                  # Calculate standard deviation
range(df$x)               # quickly get mix and max of a variable
summary(df$x)            # Returns a summary of x: mean, min, max etc.
t.test(df$x)              # Student's t-test
var(df$x)                 # Calculate variance
density(df$x)            # Compute kernel density estimates
table(df$Gender)        # counts for gender categories
table(df$`Marital Status`, data$Gender) # cross classication counts for 'gender' by 'marital status'

Alternative Descriptives Option (using the Rmisc package):

library(Rmisc)
Descriptives<- summarySE(df, measurevar="vblofinterest", groupvars=c("gender","age"))

Descriptives Option #2 (using Psych package):

library(psych)
describeBy(df$vbl, group = df$groupvar)
describe(df$vbl) # no group

### Correlation Test With 2 Variables:

cor.test(data$x, data$y, method=c("pearson")) #can be pearson, kendall, or spearman

Create Quick Correlation Matrix using ‘rcorr’ in the Hmisc package:

# making matrix of values here
x <- c(-2, -1, 0, 1, 2)
y <- c(4,   1, 0, 1, 4)
z <- c(1,   2, 3, 4, 2)
v <- c(1,   2, 0, 4, 5)
#combine columns and create correlation matrix
rcorr(cbind(x,y,z,v))

##      x     y     z    v
## x 1.00  0.00  0.55 0.76
## y 0.00  1.00 -0.70 0.39
## z 0.55 -0.70  1.00 0.23
## v 0.76  0.39  0.23 1.00
## 
## n= 5 
## 
## 
## P
##   x      y      z      v     
## x        1.0000 0.3318 0.1339
## y 1.0000        0.1852 0.5203
## z 0.3318 0.1852        0.7065
## v 0.1339 0.5203 0.7065

Missing Data Checks

View & Delete Missingness - (this does listwise deletion):

sum(is.na(df)) #easy way to see how many 'NAs' are in dataset
sum(is.na(df$x)) #or similarly, check for a particular variable

# Omit any missing data entries
clean.df<-na.omit(df)

Function to change missing values to ‘NA’ (useful for if you forgot to tell R what your missing values were when you read in your data)

fix_missing <- function(x) {
  x[x == -99] <- NA
  x
}
df[] <- lapply(df, fix_missing)

Other Missing Data Descriptives:

Look at the percent missing:

pm <- colMeans(is.na(missData))
pm

Compute the covariance coverage using the md.pairs() in the mice package

library(mice)
library(mitools)
cover <- md.pairs(missData)$rr / nrow(missData)
cover

hist(cover)
range(cover)

# Compute the unique response patterns:
pats <- md.pattern(missData)
pats

Subsetting Data in R

Typical Ways to Subset Data:

1.) my_df[1:3] (no comma) will subset my_df, returning the first three columns as a data frame.

2.) my_df[1:3, ] (with comma, numbers to left of the comma) will subset my_df and return the first three rows as a data frame.

3.) my_df[, 1:3] (with comma, numbers to right of the comma) will subset my_df and return the first three columns as a data frame, the same as my_df[1:3].

Easy Example to Physically See How Subsetting Works in R:

For subsetting rows:

# creating a fake df
df <- data.frame(x = 1:4, y = 4:1, z = letters[1:4]) 

# Subsetting by row '3'
df[df$x == 3, ]
##   x y z
## 3 3 2 c

# Calling Row 1 & 3
df[c(1, 3), ]
##   x y z
## 1 1 4 a
## 3 3 2 c

For subsetting columns:

# There are two ways to select columns from a data frame: 

  # 1. Like a list
new.df <- df[c("x", "z")] #based on vbl names
new.df <- df[c(1:3)] # based on vbl columns numbers, bracket indicates we're subsetting column 1 through 3.
  
  # 2. Like a matrix
new.df <-df[, c("x", "z")]
new.df <-df[, c(1:3)]


# There's an important difference if you select a SINGLE column...
# ...matrix subsetting simplifies by default, list subsetting does not.

str(df["x"])
## 'data.frame':    4 obs. of  1 variable:
##  $ x: int  1 2 3 4
#  VS
str(df[, "x"])
##  int [1:4] 1 2 3 4

Quick Subsetting of Rows & Columns Together:

# create random distribution. For example, draw 5 random samples of size 10 from a N(10,1):
fake.dat<-matrix(rnorm(n=100,mean=10,sd=1),ncol=10, nrow=20)
#subset 1-8 rows and columns 1:2
new.df<-fake.dat[1:6, 1:4]
print(new.df)

##           [,1]      [,2]      [,3]      [,4]
## [1,]  9.254785 12.614388  9.346585 10.090685
## [2,]  9.318042  8.092890 10.899995 10.440230
## [3,] 11.330883  9.718729 12.053058  9.108824
## [4,] 10.699723  9.391665 11.243303  7.814188
## [5,] 10.043442  8.910559 10.507499 10.974342
## [6,]  9.537012  9.758896 10.937656  9.163288

Can see that we successfully selected rows 1-6 and columns 1-4.
If you wanted to select columns or rows that were not right next to each other, can do this:

fake.dat[c(1,4,9), c(1,7,8)] #this selects rows 1,4, & 9 and columns 1, 7, & 8.

##           [,1]      [,2]      [,3]
## [1,]  9.254785 12.614388  9.346585
## [2,] 10.699723  9.391665 11.243303
## [3,]  9.612798  8.997450 11.177066

Delete a Column From a Dataset:

3 ways to delete a column quickly:

new.data<- df[colnames(df)!="vblname.to.exclude"]

# Can also do this:
df$vbl.to.exclude<- NULL

#OR if you know column(s) you want to remove, can do this:
new.data<- df[-c(2:4)] #this removes columns 2:4

Subsetting Based on Specific Components of the Dataframe:

** Subsetting based on a level of a factor - will subset all of the data and include only that specific ‘level’ of a categorical variable.**

new.data<- subset(df, level===1) #level = whatever the level of your factor is called that you're interested in.

Subsetting based on specific values of a variable

#Select cases only where Vbl1>= 20 and Vbl2 >10 + select a few other variables for a new dataset.
newdata <- subset(df, df$Vbl1>= 20 & df$Vbl2 >10, 
                  select=c(Vbl1, Vbl2, gender, ID)) # These are all the other variables you want to include in the new df

Another example for subsetting based on specific values of a variable

#Select only cases where gender is 0 (male), and 'depression' is greater then 10. Also select all variables from 'Anxiety' to 'PTSD'
newdata2 <- subset(df, df$gender=="0" & df$depression>10, select=c(Anxiety:PTSD))

Subsetting THIS condition OR that condition

#If we wanted all rows where either d == "A" or a > 0.5, use the OR operator (|):
my_df[which(my_df$d == "A" | my_df$a > 0.5),]

Alternative Methods for Quick Subsetting (using package ‘dplyr’):

dplyr is nice if you want to subset from your original dataset based on a set of variables that are similarly named. For example:

library(dplyr)
# Variable starts with...
df_pract<-select(data, starts_with("var")) #example would be "var1","var2","var3", etc.
# Variable ends with...
df_pract<-select(data, ends_with("14"))

Recoding & Re-Scaling Variables:

Quick Way to Manually Re-Code ALL Variables in a df (using the package ‘car’):

To demonstrate, making fake data

‘Survey’ function for ‘var1’ –>draws 20 samples from integers 1:5 w/ replacement

library(car)
survey <- data.frame("var1" = sample(x = 1:5, size = 20, replace = TRUE),
                     "var2" = rep(1:5, each=4),
                     "age" =  sample(x = 10:45, size = 20, replace = TRUE))

Now, let’s just change var1 to have different values (currently has values 1-5)

Brackets allow subsetting of columns…could also do ‘survey$var1’
This will change all ‘5’ in column 1 (var1) to be 6, and all ‘1s’ to be ‘1.5.’

survey$var1<- lapply(survey[[1]], function(x) {x<- recode(x, "5=6;1=1.5"); x})

If we wanted to rescale all values in all columns of the entire survey dataset, we could do so like this:

library(car)
recode.cols<- apply(survey, 2, function(x) {x<- recode(x, "5=6;1=1.5"); x})
head(recode.cols)

##      var1 var2 age
## [1,]    6  1.5  45
## [2,]    4  1.5  19
## [3,]    2  1.5  19
## [4,]    2  1.5  43
## [5,]    3  2.0  21
## [6,]    3  2.0  20

# to re-scale all row values, just change the '2' to a '1'
recode.rows<- apply(survey, 1, function(x) {x<- recode(x, "5=6;1=1.5"); x})
head(recode.rows)

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,]  6.0  4.0  2.0  2.0    3    3    3    3    2     3   1.5     3   1.5
## [2,]  1.5  1.5  1.5  1.5    2    2    2    2    3     3   3.0     3   4.0
## [3,] 45.0 19.0 19.0 43.0   21   20   21   32   24    17  13.0    15  35.0
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## [1,]   1.5     6     2     3     3     2     2
## [2,]   4.0     4     4     6     6     6     6
## [3,]  45.0    11    27    13    40    40    41

Rescaling a Variable to be between 0-1 (using the ‘scales’ package):

Use same ‘survey’ dataset from above:

library(scales)
survey$new.vbl1 <- rescale(survey$var1) 
head(survey$new.vbl1)

Creating Sum Scores:

# Sum Score of specific rows 1-9
# na.rm = TRUE ---> ignores missing values in computations
data$sum <- rowSums(data[c(1:9)], na.rm=TRUE)

#or create a sum score using the specific variable names
rowSums(data[,c('vbl1', 'vbl2', 'vbl3')], na.rm=TRUE)

# if you wanted the column sums, could also do 'colSums'

Plots

Quick Base R Plots:

Using same dataset that was generated above (i.e., ‘survey’) to demonstrate plots!

Basic Histogram

hist(survey$age)

Scatterplot

scatter.smooth(survey$age, survey$var1, xlab="age", ylab="job satisfaction")

Default Plot

plot(survey$age, survey$var1, main="Plot Title Here", xlab="Age", ylab="Job Satisfaction") #basic plot with title and x/y labels & legend

Boxplot(s)

boxplot with 1 variable

boxplot(survey$age, col="purple")

Boxplot with 2 variables

survey$var2.fact<-as.factor(survey$var2) #changing to a factor to work w/ boxplot
boxplot(age ~ var2.fact, data=survey, col="blue")

Interaction Plot

Order of this command is: x.factor, trace(usally grouping vbl if there is one), response(y).

#interaction plot - interaction of varialbe of interest x age by gender
interaction.plot(main.data$age,main.data$gender,main.data$othervbl,
                 xlab="Age",ylab="vblname",trace.label = "Gender")

Stacking multiple plots within the same window

# for stacking plots - change the second number for how many technical plots you have
par(mfrow=c(1,2)) # num of rows and num of columns
hist(x, y, xlab="x axis title", ylab="y axis title", main="graph title")
hist(x, y, xlab="x axis title", ylab="y axis title", main="graph title2")

Plots Using ggplot2:

Histogram with two groups — fill = group variable

library(ggplot2)
ggplot(df, aes(x=weight, fill=group)) +
  geom_histogram(fill="white", color="black")+
  geom_vline(aes(xintercept=mean(weight)), color="blue",
             linetype="dashed")+
  labs(title="Weight histogram plot",x="Weight(kg)", y = "Count")+
  theme_classic()

Click here for a more in-depth ggplot2 tutorial

Combining plots from ggplot2 in one window (and saving these plots) using ‘cowplot’

library(cowplot)
plot2by2 <- plot_grid(plot.1, plot.2, plot.3, plot.4, #names of plots
                      labels=c("A", "B", "C", "D"), ncol = 2)
save_plot("plot2by2.png", plot2by2,
          ncol = 2, # we're saving a grid plot of 2 columns
          nrow = 2, # and 2 rows
          # each individual subplot should have an aspect ratio of 1.3
          base_aspect_ratio = 1.3)

Tables & Output

Frequency Table:

animal<-c("cat","cat","dog","dog","seal","cat","seal")
table(animal)

## animal
##  cat  dog seal 
##    3    2    2

2-Way Cross-Tabulation Using ‘gmodels’:

library(gmodels)
varx<- c(1,2,3,4,4,4,5,1,2,3) #create var x
vary<- c(2,1,1,4,5,4,2,1,2,2) #create var y

CrossTable(varx, vary) #CrossTable  has a decent amount of things you can specify too!

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  10 
## 
##  
##              | vary 
##         varx |         1 |         2 |         4 |         5 | Row Total | 
## -------------|-----------|-----------|-----------|-----------|-----------|
##            1 |         1 |         1 |         0 |         0 |         2 | 
##              |     0.267 |     0.050 |     0.400 |     0.200 |           | 
##              |     0.500 |     0.500 |     0.000 |     0.000 |     0.200 | 
##              |     0.333 |     0.250 |     0.000 |     0.000 |           | 
##              |     0.100 |     0.100 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|
##            2 |         1 |         1 |         0 |         0 |         2 | 
##              |     0.267 |     0.050 |     0.400 |     0.200 |           | 
##              |     0.500 |     0.500 |     0.000 |     0.000 |     0.200 | 
##              |     0.333 |     0.250 |     0.000 |     0.000 |           | 
##              |     0.100 |     0.100 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|
##            3 |         1 |         1 |         0 |         0 |         2 | 
##              |     0.267 |     0.050 |     0.400 |     0.200 |           | 
##              |     0.500 |     0.500 |     0.000 |     0.000 |     0.200 | 
##              |     0.333 |     0.250 |     0.000 |     0.000 |           | 
##              |     0.100 |     0.100 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|
##            4 |         0 |         0 |         2 |         1 |         3 | 
##              |     0.900 |     1.200 |     3.267 |     1.633 |           | 
##              |     0.000 |     0.000 |     0.667 |     0.333 |     0.300 | 
##              |     0.000 |     0.000 |     1.000 |     1.000 |           | 
##              |     0.000 |     0.000 |     0.200 |     0.100 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|
##            5 |         0 |         1 |         0 |         0 |         1 | 
##              |     0.300 |     0.900 |     0.200 |     0.100 |           | 
##              |     0.000 |     1.000 |     0.000 |     0.000 |     0.100 | 
##              |     0.000 |     0.250 |     0.000 |     0.000 |           | 
##              |     0.000 |     0.100 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|
## Column Total |         3 |         4 |         2 |         1 |        10 | 
##              |     0.300 |     0.400 |     0.200 |     0.100 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|
## 
##

Create a Table Manually:

outTab<- round(cbind(cf3.2, se3.2, t3.2, p3.2, fmi3.2), 3) # 3=is how many digits to round 
colnames(outTab) <- c("Estimate", "SE", "t-Stat", "p-Value", "FMI") #these correspond to the values above
rownames(outTab) <- c("rowname1", "rowname2", "rowname3", "rowname4")
outTab

Export Table:

write.csv(outTab, paste0(output.name, "resTable.csv"), row.names = TRUE)

Note: the package ‘APA Tables’ also has some helpful functions to easily and quickly create nicely formatted tables that will export directly into Microsoft word.

Helpful Tricks (but no treats, sigh)

Paste and Paste0:

The difference between paste() and paste0() is that the argument ‘sep’ by default is ” ” (paste) and “” (paste0).

stats <- paste("Sta", "tist", "ics")
stats
## [1] "Sta tist ics"

# versus

stats1 <- paste0("Sta", "tist", "ics")
stats1
## [1] "Statistics"

paste0("variable", 1:10)

##  [1] "variable1"  "variable2"  "variable3"  "variable4"  "variable5" 
##  [6] "variable6"  "variable7"  "variable8"  "variable9"  "variable10"

# add third variable
paste0("variable", 1:10, "self.report")

##  [1] "variable1self.report"  "variable2self.report" 
##  [3] "variable3self.report"  "variable4self.report" 
##  [5] "variable5self.report"  "variable6self.report" 
##  [7] "variable7self.report"  "variable8self.report" 
##  [9] "variable9self.report"  "variable10self.report"

# can add extra arguements such as 'sep' or 'collapse' depending on what you want to do.

Fun Sounds for Long Analyses:

Add a fun sound at the end of your code so you know your model is done!

library(beepr)
beep(sound = 1) ## has 1-10 diff sounds to chose from - #8 is mario :)

Change Max Print Default:

Long output? Change default print settings

options(max.print = 10000)

Find Column Number w/ Vbl Name:

Finding column number of a particular variable

which(colnames(df)=="vbl.name" )

Add Quotation Marks Easily:

Call/work with variable names without needing quotation marks using the Hmisc

I find this particularly helpful for when I need to subset something!!!!

sub.dat<- main.data[,c("var1","var2","var3")] #subsetting normally

# is equivalent to:
library(Hmisc)
sub.dat<- main.data[,Cs(var1, var2, var3)]

noquote:

Conversely, get rid of quotations using ‘noquote’

example<- c("var1","var2","var3")
noquote(example)

## [1] var1 var2 var3

Positive Definite Check:

Checking if Matrix is Positive Definite! (comes in handy for SEM models)

library(matrixcalc)
is.singular.matrix(matrix.name) #If matrix is singlar/non-invertible, it returns TRUE.

dpylr tricks:

dpylr has a ton of helpful commands. There are some really good tutorials out there that explore the many useful aspects of this package, such as this one here.

purr tricks:

Also a really handy package for various things. Check out this tutorial on purr here.

tidyr tricks:

Again, there are many great features that tidyr has. See the tutorial here or also here. The latter link has dpylr functions included too!

One of my favorites though in tidyr is the ‘complete’ function.The complete() function allows you to fill in the gaps for all observations that had no data. Essentially, you can define the observations that you want to complete, & then tell R what value to use to plug into the missing gaps.

data %>%
  tidyr::complete(GENDER, OCCUPATION, fill = list(COUNT = 0))

more tips soon to come!

Helpful R Cheat Sheet

Allie Choate

06/01/2019