This cheat sheet contains commands I either find myself googling frequently, or are just general good functions/commands to be familiar with for a variety of analyses! I will likely update this over time.
Read in a delimited text file
df <- read.table(‘file.txt’)
#if you have missing codes, can do:
df <- read.table(‘file.txt’, missing = -99)
#if you have multiple missing codes, can do:
df <- read.table(‘file.txt’, na.strings=c("-98","-99", "-999")) Similarly, read in a csv file
Read OUT a delimited text file
Read OUT a csv file
Import an SPSS data file into R:
spss.dat<- read.spss(spss.filename, use.value.labels=TRUE, to.data.frame=TRUE,
max.value.labels=Inf, trim.factor.names=FALSE)Write an R-based datset to SPSS file using ‘foreign’:
library(foreign)
write.foreign(data.frame, #dataframe name in R
"mydata.txt", #original dataframe file
"mydata1.sps", #name you want your SPSS file to be called
package="SPSS")
Check structure/dimension of a variable of df
str(dat) #check structure of data
dim(dat) #check the number of dimensions of data
class(dat$vbl) #check what type of class a vbl isPrint First or Last 10 Rows of the Data
View Names in df
Re-name all names in dataset, generally speaking
Re-name only Row Names:
Re-Name only Columns
Setting a Variable as Ordinal/Numeric/Categorical
Change the level names of a categorical variable
Change ALL variables to be all numeric:
# x = some variable
sd(df$x) # Calculate standard deviation
range(df$x) # quickly get mix and max of a variable
summary(df$x) # Returns a summary of x: mean, min, max etc.
t.test(df$x) # Student's t-test
var(df$x) # Calculate variance
density(df$x) # Compute kernel density estimates
table(df$Gender) # counts for gender categories
table(df$`Marital Status`, data$Gender) # cross classication counts for 'gender' by 'marital status'
### Correlation Test With 2 Variables:
# making matrix of values here
x <- c(-2, -1, 0, 1, 2)
y <- c(4, 1, 0, 1, 4)
z <- c(1, 2, 3, 4, 2)
v <- c(1, 2, 0, 4, 5)
#combine columns and create correlation matrix
rcorr(cbind(x,y,z,v)) ## x y z v
## x 1.00 0.00 0.55 0.76
## y 0.00 1.00 -0.70 0.39
## z 0.55 -0.70 1.00 0.23
## v 0.76 0.39 0.23 1.00
##
## n= 5
##
##
## P
## x y z v
## x 1.0000 0.3318 0.1339
## y 1.0000 0.1852 0.5203
## z 0.3318 0.1852 0.7065
## v 0.1339 0.5203 0.7065
sum(is.na(df)) #easy way to see how many 'NAs' are in dataset
sum(is.na(df$x)) #or similarly, check for a particular variable
# Omit any missing data entries
clean.df<-na.omit(df)Function to change missing values to ‘NA’ (useful for if you forgot to tell R what your missing values were when you read in your data)
Look at the percent missing:
Compute the covariance coverage using the md.pairs() in the mice package
library(mice)
library(mitools)
cover <- md.pairs(missData)$rr / nrow(missData)
cover
hist(cover)
range(cover)
# Compute the unique response patterns:
pats <- md.pattern(missData)
pats
1.) my_df[1:3] (no comma) will subset my_df, returning the first three columns as a data frame.
2.) my_df[1:3, ] (with comma, numbers to left of the comma) will subset my_df and return the first three rows as a data frame.
3.) my_df[, 1:3] (with comma, numbers to right of the comma) will subset my_df and return the first three columns as a data frame, the same as my_df[1:3].
For subsetting rows:
# creating a fake df
df <- data.frame(x = 1:4, y = 4:1, z = letters[1:4])
# Subsetting by row '3'
df[df$x == 3, ]
## x y z
## 3 3 2 c
# Calling Row 1 & 3
df[c(1, 3), ]
## x y z
## 1 1 4 a
## 3 3 2 cFor subsetting columns:
# There are two ways to select columns from a data frame:
# 1. Like a list
new.df <- df[c("x", "z")] #based on vbl names
new.df <- df[c(1:3)] # based on vbl columns numbers, bracket indicates we're subsetting column 1 through 3.
# 2. Like a matrix
new.df <-df[, c("x", "z")]
new.df <-df[, c(1:3)]
# There's an important difference if you select a SINGLE column...
# ...matrix subsetting simplifies by default, list subsetting does not.
str(df["x"])
## 'data.frame': 4 obs. of 1 variable:
## $ x: int 1 2 3 4
# VS
str(df[, "x"])
## int [1:4] 1 2 3 4Quick Subsetting of Rows & Columns Together:
# create random distribution. For example, draw 5 random samples of size 10 from a N(10,1):
fake.dat<-matrix(rnorm(n=100,mean=10,sd=1),ncol=10, nrow=20)
#subset 1-8 rows and columns 1:2
new.df<-fake.dat[1:6, 1:4]
print(new.df)## [,1] [,2] [,3] [,4]
## [1,] 9.254785 12.614388 9.346585 10.090685
## [2,] 9.318042 8.092890 10.899995 10.440230
## [3,] 11.330883 9.718729 12.053058 9.108824
## [4,] 10.699723 9.391665 11.243303 7.814188
## [5,] 10.043442 8.910559 10.507499 10.974342
## [6,] 9.537012 9.758896 10.937656 9.163288
Can see that we successfully selected rows 1-6 and columns 1-4.
If you wanted to select columns or rows that were not right next to each other, can do this:
## [,1] [,2] [,3]
## [1,] 9.254785 12.614388 9.346585
## [2,] 10.699723 9.391665 11.243303
## [3,] 9.612798 8.997450 11.177066
3 ways to delete a column quickly:
new.data<- df[colnames(df)!="vblname.to.exclude"]
# Can also do this:
df$vbl.to.exclude<- NULL
#OR if you know column(s) you want to remove, can do this:
new.data<- df[-c(2:4)] #this removes columns 2:4
** Subsetting based on a level of a factor - will subset all of the data and include only that specific ‘level’ of a categorical variable.**
Subsetting based on specific values of a variable
#Select cases only where Vbl1>= 20 and Vbl2 >10 + select a few other variables for a new dataset.
newdata <- subset(df, df$Vbl1>= 20 & df$Vbl2 >10,
select=c(Vbl1, Vbl2, gender, ID)) # These are all the other variables you want to include in the new dfAnother example for subsetting based on specific values of a variable
#Select only cases where gender is 0 (male), and 'depression' is greater then 10. Also select all variables from 'Anxiety' to 'PTSD'
newdata2 <- subset(df, df$gender=="0" & df$depression>10, select=c(Anxiety:PTSD))Subsetting THIS condition OR that condition
#If we wanted all rows where either d == "A" or a > 0.5, use the OR operator (|):
my_df[which(my_df$d == "A" | my_df$a > 0.5),]
dplyr is nice if you want to subset from your original dataset based on a set of variables that are similarly named. For example:
library(dplyr)
# Variable starts with...
df_pract<-select(data, starts_with("var")) #example would be "var1","var2","var3", etc.
# Variable ends with...
df_pract<-select(data, ends_with("14"))
To demonstrate, making fake data
library(car)
survey <- data.frame("var1" = sample(x = 1:5, size = 20, replace = TRUE),
"var2" = rep(1:5, each=4),
"age" = sample(x = 10:45, size = 20, replace = TRUE))Now, let’s just change var1 to have different values (currently has values 1-5)
If we wanted to rescale all values in all columns of the entire survey dataset, we could do so like this:
library(car)
recode.cols<- apply(survey, 2, function(x) {x<- recode(x, "5=6;1=1.5"); x})
head(recode.cols)## var1 var2 age
## [1,] 6 1.5 45
## [2,] 4 1.5 19
## [3,] 2 1.5 19
## [4,] 2 1.5 43
## [5,] 3 2.0 21
## [6,] 3 2.0 20
# to re-scale all row values, just change the '2' to a '1'
recode.rows<- apply(survey, 1, function(x) {x<- recode(x, "5=6;1=1.5"); x})
head(recode.rows)## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] 6.0 4.0 2.0 2.0 3 3 3 3 2 3 1.5 3 1.5
## [2,] 1.5 1.5 1.5 1.5 2 2 2 2 3 3 3.0 3 4.0
## [3,] 45.0 19.0 19.0 43.0 21 20 21 32 24 17 13.0 15 35.0
## [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## [1,] 1.5 6 2 3 3 2 2
## [2,] 4.0 4 4 6 6 6 6
## [3,] 45.0 11 27 13 40 40 41
# Sum Score of specific rows 1-9
# na.rm = TRUE ---> ignores missing values in computations
data$sum <- rowSums(data[c(1:9)], na.rm=TRUE)
#or create a sum score using the specific variable names
rowSums(data[,c('vbl1', 'vbl2', 'vbl3')], na.rm=TRUE)
# if you wanted the column sums, could also do 'colSums'
Using same dataset that was generated above (i.e., ‘survey’) to demonstrate plots!
boxplot with 1 variable
Boxplot with 2 variables
survey$var2.fact<-as.factor(survey$var2) #changing to a factor to work w/ boxplot
boxplot(age ~ var2.fact, data=survey, col="blue")
#interaction plot - interaction of varialbe of interest x age by gender
interaction.plot(main.data$age,main.data$gender,main.data$othervbl,
xlab="Age",ylab="vblname",trace.label = "Gender")
Stacking multiple plots within the same window
# for stacking plots - change the second number for how many technical plots you have
par(mfrow=c(1,2)) # num of rows and num of columns
hist(x, y, xlab="x axis title", ylab="y axis title", main="graph title")
hist(x, y, xlab="x axis title", ylab="y axis title", main="graph title2")
Histogram with two groups — fill = group variable
library(ggplot2)
ggplot(df, aes(x=weight, fill=group)) +
geom_histogram(fill="white", color="black")+
geom_vline(aes(xintercept=mean(weight)), color="blue",
linetype="dashed")+
labs(title="Weight histogram plot",x="Weight(kg)", y = "Count")+
theme_classic()Click here for a more in-depth ggplot2 tutorial
Combining plots from ggplot2 in one window (and saving these plots) using ‘cowplot’
library(cowplot)
plot2by2 <- plot_grid(plot.1, plot.2, plot.3, plot.4, #names of plots
labels=c("A", "B", "C", "D"), ncol = 2)
save_plot("plot2by2.png", plot2by2,
ncol = 2, # we're saving a grid plot of 2 columns
nrow = 2, # and 2 rows
# each individual subplot should have an aspect ratio of 1.3
base_aspect_ratio = 1.3)
## animal
## cat dog seal
## 3 2 2
library(gmodels)
varx<- c(1,2,3,4,4,4,5,1,2,3) #create var x
vary<- c(2,1,1,4,5,4,2,1,2,2) #create var y
CrossTable(varx, vary) #CrossTable has a decent amount of things you can specify too!##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 10
##
##
## | vary
## varx | 1 | 2 | 4 | 5 | Row Total |
## -------------|-----------|-----------|-----------|-----------|-----------|
## 1 | 1 | 1 | 0 | 0 | 2 |
## | 0.267 | 0.050 | 0.400 | 0.200 | |
## | 0.500 | 0.500 | 0.000 | 0.000 | 0.200 |
## | 0.333 | 0.250 | 0.000 | 0.000 | |
## | 0.100 | 0.100 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
## 2 | 1 | 1 | 0 | 0 | 2 |
## | 0.267 | 0.050 | 0.400 | 0.200 | |
## | 0.500 | 0.500 | 0.000 | 0.000 | 0.200 |
## | 0.333 | 0.250 | 0.000 | 0.000 | |
## | 0.100 | 0.100 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
## 3 | 1 | 1 | 0 | 0 | 2 |
## | 0.267 | 0.050 | 0.400 | 0.200 | |
## | 0.500 | 0.500 | 0.000 | 0.000 | 0.200 |
## | 0.333 | 0.250 | 0.000 | 0.000 | |
## | 0.100 | 0.100 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
## 4 | 0 | 0 | 2 | 1 | 3 |
## | 0.900 | 1.200 | 3.267 | 1.633 | |
## | 0.000 | 0.000 | 0.667 | 0.333 | 0.300 |
## | 0.000 | 0.000 | 1.000 | 1.000 | |
## | 0.000 | 0.000 | 0.200 | 0.100 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
## 5 | 0 | 1 | 0 | 0 | 1 |
## | 0.300 | 0.900 | 0.200 | 0.100 | |
## | 0.000 | 1.000 | 0.000 | 0.000 | 0.100 |
## | 0.000 | 0.250 | 0.000 | 0.000 | |
## | 0.000 | 0.100 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 3 | 4 | 2 | 1 | 10 |
## | 0.300 | 0.400 | 0.200 | 0.100 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
##
##
outTab<- round(cbind(cf3.2, se3.2, t3.2, p3.2, fmi3.2), 3) # 3=is how many digits to round
colnames(outTab) <- c("Estimate", "SE", "t-Stat", "p-Value", "FMI") #these correspond to the values above
rownames(outTab) <- c("rowname1", "rowname2", "rowname3", "rowname4")
outTabNote: the package ‘APA Tables’ also has some helpful functions to easily and quickly create nicely formatted tables that will export directly into Microsoft word.
The difference between paste() and paste0() is that the argument ‘sep’ by default is ” ” (paste) and “” (paste0).
stats <- paste("Sta", "tist", "ics")
stats
## [1] "Sta tist ics"
# versus
stats1 <- paste0("Sta", "tist", "ics")
stats1
## [1] "Statistics"## [1] "variable1" "variable2" "variable3" "variable4" "variable5"
## [6] "variable6" "variable7" "variable8" "variable9" "variable10"
## [1] "variable1self.report" "variable2self.report"
## [3] "variable3self.report" "variable4self.report"
## [5] "variable5self.report" "variable6self.report"
## [7] "variable7self.report" "variable8self.report"
## [9] "variable9self.report" "variable10self.report"
Add a fun sound at the end of your code so you know your model is done!
Finding column number of a particular variable
Call/work with variable names without needing quotation marks using the Hmisc
sub.dat<- main.data[,c("var1","var2","var3")] #subsetting normally
# is equivalent to:
library(Hmisc)
sub.dat<- main.data[,Cs(var1, var2, var3)]Conversely, get rid of quotations using ‘noquote’
## [1] var1 var2 var3
Checking if Matrix is Positive Definite! (comes in handy for SEM models)
library(matrixcalc)
is.singular.matrix(matrix.name) #If matrix is singlar/non-invertible, it returns TRUE.
dpylr has a ton of helpful commands. There are some really good tutorials out there that explore the many useful aspects of this package, such as this one here.
Again, there are many great features that tidyr has. See the tutorial here or also here. The latter link has dpylr functions included too!
One of my favorites though in tidyr is the ‘complete’ function.The complete() function allows you to fill in the gaps for all observations that had no data. Essentially, you can define the observations that you want to complete, & then tell R what value to use to plug into the missing gaps.
more tips soon to come!