1 About R

R is used by many companies, including Google, Microsoft, Oracle, SAP and many others. The R Consortium is a group of companies that is actively supporting further development of the language. More about R can be found here.

2 Setting up your system

2.1 Programs

Download and install R, RStudio and if you are on Windows also Rtools.

R is the statistical programming language, RStudio is where you write your programs and Rtools is a utility program for Windows.

2.2 Libraries

There are over 7000 libraries for R on CRAN. Go to Taskview for a theme oriented view of available packages.

Standard (Base) R is not very easy or fast to work with, luckily there are many packages that make life easier.

Here is how you install them (make sure you have an internet connection):

mylibraries = c(
  "data.table",  # fast and compact data processing
  "stringr",     # work with character strings
  "ggplot2",     # comprehensive plotting environment
  "zoo",         # work with time series
  "lubridate",   # work with dates
  "DescTools",   # descriptive analytics
  "readxl"       # read Excel files
)

# install.packages(mylibraries, dep = T)

If you run these commands, remove the “#” (comment) sign. Alternatively, you can install packages in RStudio via a menu: /Tools/Install packages/ (…type the names of the packages…).

Now that the packages are installed, you have to load them before you can work with them. Here is how you do that:

library(data.table)
library(stringr)
library(ggplot2)
library(zoo)
library(lubridate)
library(DescTools)

3 Starting with R

3.1 Opening R

Open RStudio, this will also open R.

Open a script file by clicking the “+” sign under “File” in the upper left corner.

You now have a complete R developing environment:

  • upper left corner is where you write your programs (scripts)
  • lower left corner is the command line of R
  • right side is where can find tabs for environment (loaded variables), history of commands, help, plot views and others.

3.2 Loading and examining a dataset

R comes with some datasets. We are going to use mtcars.

In the script window (upper left), type the following:

data(mtcars)         # loads mtcars dataset

View(mtcars)         # view the data in a separate window

dim(mtcars)          # dimensions: rows x columns

str(mtcars)          # structure: name and type of variable

names(mtcars)        # column (variable) names of the dataset

head(mtcars)         # first few records of the dataset

summary(mtcars)      # statistical summary

?mtcars              # help about a dataset or command

getwd()              # which directory are you in?

dir()                # what are the files in that directory?

ls()                 # list the loaded objects in your current R session

#                    # comment: everything after this sign is ignored by R

1:5                  # prints the numbers 1 to 5

sum(1:5)             # sums the numbers 1 to 5

c(1,2,3)             # c = "concatenate" = make a vector with the numbers 1, 2 and 3

x = c(1,2,3)         # assign a name

You can run these commands by clicking “Source” in the upper right corner of the script window. It is also possible to run the commands one by one. Put your cursor anywhere on the line you want to run and press control-Enter.

Note that the View window is very powerful. You can open the View in its own window, filter, search and sort.

4 Doing things with data.table

data.table is a fast and compact “mini language” for dealing with data (“data wrangling”), that resembles SQL.

It works like this:

4.1 Setting to data.table

If a (text, csv) file is read with data.table’s fread command, it is already a data.table object. If not, you have to set it manually.

There are 2 cases: * The object contains names rows. This is the case in the mtcars object. You can see this by the fact that the column above the car names has no name. * The object contains no names rows.

setDT(mtcars, keep.rownames = T)  # set to data.table, and keep rownames, these will be called "rn"

4.2 Do something with rows: filtering and selecting

The operations that you typically do with rows is filtering and selecting.

mtcars[1,]                      # get the first row
mtcars[1:3,]                    # get the first 3 rows

mtcars[rn == "Duster 360",]     # get a specific row, method 1
mtcars[rn %in% "Duster 360"]    # get a specific row, method 2
mtcars[rn %like% "Merc"]        # get rows with an approximate value, in this case "Merc"
mtcars[!1,]                     # exclude the first row
mtcars[!1:30,]                  # exclude the first 30 rows
mtcars[rn != "Duster 360"]      # exclude a specific row
mtcars[!rn %like% "Merc"]       # exclude rows with approximate values

mtcars[mpg %between% c(19, 20)] # find rows with a value between an interval, in this case: mpg between 19 and 20
mtcars[mpg > 30,]               # find rows with a value larger than ...
mtcars[mpg > 30 & hp > 100,]    # combine search criteria

4.3 Do something with columns: selecting, creating new variables

# mtcars[, .(mpg, cyl)]                   # select columns, method 1
# mtcars[, c("mpg", "cyl"), with = F]     # select columns, method 2
# mtcars[, c(2, 3), with = F]             # select columns, method 3 (discouraged: less transparent)

# mtcars[, c("mpg", "cyl") := NULL]       # remove columns, method 1
# mtcars[, !c("mpg", "cyl"), with = F]    # remove columns, method 2
 
mtcars[, hp_wt := hp/wt]                # new variable: calculate horsepower per weight (in this case: 1000 lbs)

mtcars[, `:=` (                         # calculate several new variables at once
  hp_wt = hp/wt,
  mpg_wt = mpg/wt
)]

mtcars[rn %like% "Merc", .N]            # ".N" is an abbreviation for "count" in data.table

4.4 Grouping operations

mtcars[, mean(mpg), by = cyl]           # calculate average mpg per cylinder
mtcars[, .(avg = mean(mpg)), by = cyl]  # ditto, give name to calculated variable

mtcars[rn %like% "Merc", .N, by = cyl]  # group a selection by something, in this case count "Merc" cars by cylinder
# all together now:
mtcars[rn %like% "Merc" & gear >= 3 & cyl == 4, .(rn, mpg, cyl, gear), by = mpg]