One of the big upsides to R is a feature we call “extensibility.” In other words, the ability of users all around the world to create their own R solutions and make them available to others is a big selling point. Almost any complex problem that has required solving in R has been solved by someone out there in the world, and often, they have published those solutions for all of our use. Thanks, everyone.
I think it is first important to note that being able to write functions yourself is a convenience, not a requirement, of working in R. However, getting a general understanding of how this process works, I think, will help you become a better R programmer in general.
Let’s start with an example - the modulus (a.k.a. “modulo”) function. This is not a common term, but it is an operation I’m sure you know very well. The modulus is basically the remainder in division. So:
In R, there is actually a modulo operator - %%
- so you can test these yourself.
17 %% 5
[1] 2
Now, what if there hadn’t been a modulo operator?
Let’s make a function! The basic setup for making a function looks like the following.
myFunctionName <- function(argumentNames) {
[STUFF THE FUNCTION DOES TO THE ARGUMENTS]
return([WHATEVER THE FUNCTION GIVES BACK TO YOU])
}
Before we make the modulo function, what steps do we have to do in order to calculate the modulo? Here’s my proposal to get Y modulo X, where Y and X are whole numbers:
In equation terms:
Ymod
So let’s make a function!
modulo <- function(y, x) {
q <- y/x
q <- floor(q)
p <- q*x
mod <- y-p
return(mod)
}
I have followed exactly those steps I laid out above. Let’s test it!
modulo(17, 5)
[1] 2
It is important to note that R reads in the arguments in the same order that they are entered. If I wanted to be more specific, I would specify which arguments belonged to which number as follows.
modulo(y=17, x=5)
[1] 2
And, with them explicitly named, I can put them in any order I want.
modulo(x=5, y=17)
[1] 2
Now, it’s not absolutely essential for you to know and create functions yourself, but it can come in handy if there’s some operation that you will be doing over and over again in your analysis. Instead of writing out all of the steps every time, you can just make a function that you can use again and again!
For example, I often find that I create the same type of summary statistics table in every paper that I do. I created a function called getvarsums
, which I use in every analysis, that creates the formatted table.
# Get summary statistics for a set of variables
getvarsums <- function(x, items) {
res <- data.frame(t(data.frame(lapply(x,
function(x) rbind(mean = round(mean(x, na.rm=T), 2),
sd = round(sd(x, na.rm=T), 2),
min = round(min(x, na.rm=T), 2),
max = round(max(x, na.rm=T), 2))))))
res <- cbind(items, res)
names(res) <- c("", "Mean", "Std. Dev.", "Minimum", "Maximum")
return(res)
}
You don’t have to know exactly what that all is supposed to do, but you can see from the way I use it how it works. Copy and paste the above in your script editor and run it to set up the function. You should now see it in your environment window.
Let’s use it on all of those “cst” variables in the api
dataset. Before we begin, let’s read in that API data.
# Sets working directory
setwd("C:/Users/Richard/Desktop/R Introduction")
# Reads in data file
api <- read.csv("apidata_2012_so.csv")
Now let’s use that function!
# Creates subset of data with the variables I want
apisubset <- api[c("cst28_engl", "cst28_math", "cst911_engl", "cst911_math")]
# Runs function
getvarsums(apisubset, c("2-8 English", "2-8 Math", "9-11 English", "9-11 Math"))
Mean Std. Dev. Minimum Maximum
cst28_engl 2-8 English 3015.41 10762.86 0 303153
cst28_math 2-8 Math 3012.04 10752.71 0 302887
cst911_engl 9-11 English 1277.79 4472.24 0 113705
cst911_math 9-11 Math 1205.57 4208.97 0 106209
Note that I could have output that table to an object and then copied and pasted the numbers into anything from Excel to Word.
People across the world have created hundreds of helpful functions for R to do things from mine text data to create custom graphics formats. R comes with a base set of functions, but often, you will want to use one that isn’t included with the base functions in R.
People release these functions, often, in groups called “packages.” For example, Matt Shotwell of Vanderbilt University created a package called sas7bdat
that contains a function that can read in SAS format files.
To download a package, use the install.packages()
function.
install.packages("sas7bdat")
To know that it finished successfully, you will see something along the lines of “package ‘sas7bdat’ successfully unpacked and MD5 sums checked” at the end of the output. Sometimes, when installing one package, R will go ahead and install other packages on which that package depends. You can also install packages by going to the menu bar and selecting “Tools \rightarrow Install Packages…”
Packages come with their own help files and documentation. You only have to install packages once, and then they’re forever linked to your R system.
Before you can actually use any of the functions in the package, you have to “load” the package. For example, the following code loads the sas7bdat
package.
library(sas7bdat)
(Why does installing the package require quotes, but loading it doesn’t? I didn’t make the rules, sorry.)
Yes, you have to do that every time you begin R if you want to use the sas7bdat
functions. The safest way to handle this is to just load your libraries at the beginning of your R script in a list. For example, given the functions I use the most, I quite frequently have something that looks like this at the top of my scripts.
library(ggplot2)
library(dplyr)
library(reshape2)
library(haven)
That way, I already know that I can use any of the functions from those four packages in my code. And if there’s another one I need, I go ahead and add it to that list.
It’s particularly important that we get to this now, because some of the more advanced methods we will be using soon will require outside packages.
For fun, try to Google a package that will allow you to read in SPSS files and install it.
Run:
Now that we have that down, I think it’s about time that we return to actual data analysis: http://rpubs.com/rslbliss/r_logistic_ws