1 Why R?

has a number of advantages compared to other statistical software tools (i.e. SAS, SPSS, Stata, etc …).

Open-source (It is free!!!)
Cross-platform (Windows, Mac, Linux)
Updated regularly
Extremely flexible and can do or be made to do joust about anything
Amazing graphical capabilities

2 Creating a new project directory in RStudio

Let’s create a new project directory for our “Introduction to R” lesson today.

Open RStudio
Go to the File menu and select New Project.
In the New Project window, choose New Directory. Then, choose Empty Project. Name your new directory Intro-to-R and then “Create the project as subdirectory of:” the Desktop (or location of your choice).
Click on Create Project.
After your project is completed, if the project does not automatically open in RStudio, then go to the File menu, select Open Project, and choose Intro-to-R.Rproj.
When RStudio opens, you will see three panels in the window.
Go to the File menu and select New File, and select R Script.

2.1 RStudio Interface

The RStudio interface has four main panels:

Console: where you can type commands and see output. The console is all you would see if you ran R in the command line without RStudio.
Script editor: where you can type out commands and save to file. You can also submit the commands to run in the console.
Environment/History: environment shows all active objects and history keeps track of all commands run in console
Files/Plots/Packages/Help

2.2 Organizing your working directory & setting up

2.2.1 Viewing your working directory

Before we organize our working directory, let’s check to see where our current working directory is located by typing into the console:

getwd()

If you wanted to choose a different directory to be your working directory, you could navigate to a different folder in the Files tab, then, click on the More dropdown menu and select Set As Working Directoryor typing into the console:

setwd(path)

2.2.2 Console command prompt

Interpreting the command prompt can help understand when R is ready to accept commands. Below lists the different states of the command prompt and how you can exit a command:

Console is ready to accept commands: >.

If R is ready to accept commands, the R console shows a > prompt.

When the console receives a command (by directly typing into the console or running from the script editor (Ctrl-Enter), R will try to execute it.

After running, the console will show the results and come back with a new > prompt to wait for new commands.

Console is waiting for you to enter more data: +.

If R is still waiting for you to enter more data because it isn’t complete yet, the console will show a + prompt. It means that you haven’t finished entering a complete command. Often this can be due to you having not ‘closed’ a parenthesis or quotation.

Escaping a command and getting a new prompt: esc

If you’re in Rstudio and you can’t figure out why your command isn’t running, you can click inside the console window and press esc to escape the command and bring back a new prompt >.

3 Simple computations with R

R can be used as a calculator. You can just type your equation and execute the command:

2+2

1+2*3-4/5

(19465*0.25)^23

5%%2

Example: Addition of two values

3 + 6

## [1] 9

# The output is always preceded by a number between brackets: [1]

3.1 Math functions

log(x) Natural log.
sum(x) Sum.
exp(x) Exponential.
mean(x) Mean.
max(x) Largest element.
median(x) Median.
min(x) Smallest element.
quantile(x) Percentage quantiles.
round(x, n) Round to n decimal places.
rank(x) Rank of elements.
var(x) The variance.
cor(x,y) Correlation.
sd(x) The standard deviation.

4 Variable Assignment

You can assign a number to a name.

x <- 3

Now “x” is called a variable and it appears in the workspace window, which means that R stores the value of “x” in its memory and it can be used later.

In general, by using the <-, you can assign a value to an object

If you type the name of a variable, the current value of the variable will be printed

## [1] 3

There are variables that are already defined in R, like variable “pi”

pi

## [1] 3.141593

Calculating the perimeter of the circumference with radius 3

2 * pi * x

## [1] 18.84956

Changing the value of radius and reusing the code

x <- 5
2 * pi * x

## [1] 31.41593

Remarks

R is case sensitive

A <- 33
a <- 44
A

## [1] 33

## [1] 44

The tag # indicates a comment

5 Basic Data types

Numeric data: 1, 2, 3

x <- c(1, 2, 3); 
x

## [1] 1 2 3

is.numeric(x)

## [1] TRUE

Character data: “a”, “b”, “c”

x <- c("1", "2", "3"); x

## [1] "1" "2" "3"

is.character(x)

## [1] TRUE

Logical data

x <- 1:10 < 5
x

##  [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

!x

##  [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

which(x) # Returns index for the 'TRUE' values in logical vector

## [1] 1 2 3 4

Factor : Character strings with preset levels. (Needed for some statistical models).

factor(c("1", "0", "1", "0", "1"))

## [1] 1 0 1 0 1
## Levels: 0 1

6 Basic Structures

Vectors (1D)

myVec <- 1:10; names(myVec) <- letters[1:10]
class(myVec)

## [1] "integer"

Matrices (2D): two dimensional structures with data of same type

myMA <- matrix(1:30, 3, 10, byrow = TRUE)
class(myMA)

## [1] "matrix" "array"

Data Frames (2D): two dimensional structures with variable data types

myDF <- data.frame(Col1=1:10, Col2=10:1)
class(myDF)

## [1] "data.frame"

Lists: containers for any object type

myL <- list(name="Fred", wife="Mary", no.children=3, child.ages=c(4,7,9))
class(myL)

## [1] "list"

7 The R environment

ls() List all variables in the environment.
rm(x) Remove x from the environment.
rm(list = ls()) Remove all variables from the environment.

Remarks You can use the environment panel in RStudio to browse variables in your environment.

8 Using scripts

However, instead of working directly on the R console it is usually more convenient to use R scripts.

An R script is a text file where you can type the commands that you want to execute in R.

Scripts have file names with the extension .R, for instance, “myscript.R”. The R script is where you keep a record of your work. Using R scripts is very convenient because all R commands used in a session are saved in the script file and can be executed again in a future session.

To create a new R script go to

File -> New -> R Script

To open an existing R script go to

File -> Open -> R Script->select your script

If you want to run a line from the script window, you place the cursor in any place of the line and click Run or press CTRL+ENTER if you are using Windows/Linux or Command+Enter if you are using MAC.

You can execute part of a script (or the whole script) by selecting the corresponding lines and pressing Run or CTRL+ENTER or Command+Enter.

You can also execute the whole script by using the R function source( )

source("scriptname.R")

9 Basic base functions

Usual tasks in R involve functions. R comes with a slew of pre-installed functions. These functions are installed as part of the base package which is located in your ‘’library’ directory.

An R function is used by typing its name followed by its arguments (also called parameters) between parentheses.

Example: seq( ) is a function for generating a sequence of numbers. Its arguments are arg1=from, which specifies the first number of the sequence, arg2= to, last number of the the sequence, and arg3=by, the increment of the sequence.

seq(10,80, 2) # generates a sequence from 10 to 20 with increment 2

##  [1] 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58
## [26] 60 62 64 66 68 70 72 74 76 78 80

The number between brackets at the beginning of each line of the output indicates the position of the first element of each row: 10 is the first element of the output and 56 is the 24th element of the output.

R treats all functions like ‘objects’.
All functions have names and take arguments in parentheses: ‘function()’

For example, complex operations often requires an input and gives and output. This is done with the function ‘print()’.

print("Hello world")
print(exp)

10 Reading and writing data files

read.table( ): this function reads a text data file and stores the values into a data frame

header=T : first row of the data file contains the names of the columns or variables sep="" : the values of each row are separated by a space sep="\t" : the values of each row are separated by a tabulation sep="," : the values of each row are separated by a comma dec=".": the decimal symbol is a point

Example

example <- read.table("treatment.txt", header=F, sep="")
# This instruction reads file "treatment.txt" and creates the dataframe "example"

write.table( ): this function writes a data frame into a text file

Example

write.table(treatment, file="treatment.txt", row.names=FALSE)
# This instruction writes the dataframe "treatment" in file "treatment.txt"
# row.names=FALSE prevents R for printint the names of the rows (or just the row numbers) in the output file

11 Libraries/Packages

R comes with a standard set of packages. Others are available for download and installation.

Standard packages: The standard (or base) packages are considered part of the R source code. They contain the basic functions that allow R to work, and the datasets and standard statistical and graphical functions. They should be automatically available in any R installation.

Contributed packages: There are thousands of contributed packages for R, written by many different authors. Some of these packages implement specialized statistical methods. Some (the recommended packages) are distributed with every binary distribution of R.

Most packages are available for download from CRAN and other repositories such as Bioconductor, a large repository of tools for the analysis and comprehension of high-throughput genomic data.

Once installed, a package has to be loaded into the session to be used.

11.1 Install packages from CRAN

Using an R package or library for the first time requires two steps: installing the library and loading the library with the following functions:

install.packages(): Install the package

library(): load the package

Example: How to install and load the library “survival”?

install.packages("survival") # you only need to do this once
library(survival) # load library

11.2 Install packages from GitHub

To install packages grom GitHub make sure you have the newest version of the devtools package. To do so run:

install.packages("devtools")

Then you have two options:

Use the install_github() function.

install_github("Momocs",username="jfpalomeque")

Another way is trying to download the zip file and install it with the normal install.packages() function in R with:

install.packages(file_name_and_path, repos = NULL, type="source")

12 Help and documentation

help( ) or ? : provides information about a function or an object

Example:

help(mean)
?mean

help.search( ): provides information about a topic

Example:

help.search("logistic regression")

13 Basic data management

13.1 Creating new variables

In a typical research project, you’ll need to create new variables and transform existing ones. This is accomplished with statements of the form:

variable <- expression

Example

Let’s say that you have a data frame named mydata, with variables x1 and x2,

mydata <- data.frame(x1=runif(100), x2=runif(100))

and you want to create a new variable sumx that adds these two variables and a new variable called meanx that averages the two variables. If you use the code

sumx <- x1 + x2
meanx <- (x1 + x2)/2

you’ll get an error, because R doesn’t know that x1 and x2 are from data frame mydata.

If you use this code instead, the statements will succeed but you’ll end up with a data frame (mydata) and two separate vectors (sumx and meanx).

sumx <- mydata$x1 + mydata$x2
meanx <- (mydata$x1 + mydata$x2)/2

Ultimately, you want to incorporate new variables into the original data frame:

mydata<-data.frame(x1 = c(2, 2, 6, 4), x2 = c(3, 4, 2, 8))
mydata$sumx <- mydata$x1 + mydata$x2
mydata$meanx <- (mydata$x1 + mydata$x2)/2
head(mydata)

##   x1 x2 sumx meanx
## 1  2  3    5   2.5
## 2  2  4    6   3.0
## 3  6  2    8   4.0
## 4  4  8   12   6.0

13.2 Recoding variables

Recoding involves creating new values of a variable conditional on the existing values of the same and/or other variables. For example, you may want to:

Change a continuous variable into a set of categorie.
Create a pass/fail variable based on a set of cutoff scores.
Replace miscoded values with correct values.

set.seed(123456)
rdatos <- data.frame(ID=seq(1,1000,1), Age=rbinom(n=1000,105, 0.55))
rdatos$Agecat[rdatos$Age > 70] <- "Elder"
rdatos$Agecat[rdatos$Age  >= 50 & rdatos$Age <= 70] <- "Middle Aged"
rdatos$Agecat[rdatos$Age < 50] <- "Young"
head(rdatos)

##   ID Age      Agecat
## 1  1  53 Middle Aged
## 2  2  53 Middle Aged
## 3  3  63 Middle Aged
## 4  4  52 Middle Aged
## 5  5  58 Middle Aged
## 6  6  56 Middle Aged

Using the within function:

set.seed(123456)
rdatos <- data.frame(ID=seq(1,1000,1), Age=rbinom(n=1000,105, 0.55), Gender=rbinom(n=1000, 1, 0.5))
rdatos <- within(rdatos,{
Agecat <- NA
Agecat[Age > 70] <- "Elder"
Agecat[Age >= 50 & Age <= 70] <- "Middle Aged"
Agecat[Age < 50] <- "Young" })


rdatos <- within(rdatos,{
  GenderRN <- NA
  GenderRN[Gender==0] <- "Female"
  GenderRN[Gender==1] <- "Male"})

head(rdatos)

##   ID Age Gender      Agecat GenderRN
## 1  1  53      0 Middle Aged   Female
## 2  2  53      1 Middle Aged     Male
## 3  3  63      1 Middle Aged     Male
## 4  4  52      1 Middle Aged     Male
## 5  5  58      0 Middle Aged   Female
## 6  6  56      1 Middle Aged     Male

Several packages offer useful recoding functions; in particular, the car package’s recode() function recodes numeric and character vectors and factors very simply.

13.3 Renaming variables

You can change variable names. Let’s say that you want to change the variables Age to Age0 and ID to managerID. You can use the statement rename(). The format of the rename() function is

rename(dataframe, c(oldname="newname", oldname="newname",.))

Example

library(reshape)
rdatos <- rename(rdatos, c(ID="managerID", Age="Age0"))
names(rdatos)

## [1] "managerID" "Age0"      "Gender"    "Agecat"    "GenderRN"

13.4 Missing values

Data sets are likely to be incomplete because of missed questions, faulty equipment, or improperly coded data. In R, missing values are represented by the symbol NA (not available). Impossible values (for example, dividing by 0) are represented by the symbol NaN (not a number).

R provides a number of functions for identifying observations that contain missing values. The function is.na() allows you to test for the presence of missing values.

Example

Assume that you have a vector:

y <- c(1, 2, 3, NA)
y

## [1]  1  2  3 NA

then the function is.na(y) returns c(FALSE, FALSE, FALSE, TRUE).

We can use assignments to recode values to missing. For example, we can code missing values as 99. In this case, we must let R know that the value 99 means missing in this case (otherwise the mean age for this sample of bosses will be way off!). You can accomplish this by recoding the variable:

rdatos$age[rdatos$age == 99] <- NA

Any value of age that’s equal to 99 is changed to NA.

13.4.1 Excluding missing values from analyses

Once you’ve identified the missing values, you need to eliminate them in some way before analyzing your data further. The reason is that arithmetic expressions and functions that contain missing values yield missing values. For example, consider the following code:

x <- c(1, 2, NA, 3)
y <- x[1] + x[2] + x[3] + x[4]
z <- sum(x)
z

## [1] NA

Both y and z will be NA (missing) because the third element of x is missing. Luckily, most numeric functions have a na.rm=TRUE option that removes missing values prior to calculations and applies the function to the remaining values:

x <- c(1, 2, NA, 3)
y <- sum(x, na.rm=TRUE)
y

## [1] 6

13.5 Sorting data

Sometimes, viewing a dataset in a sorted order can tell you quite a bit about the data. To sort a data frame in R, use the order() function.

Example The statement:

newdata <- rdatos[order(rdatos$Age0),]
newdata[1:10,]

##     managerID Age0 Gender Agecat GenderRN
## 38         38   40      1  Young     Male
## 725       725   43      1  Young     Male
## 324       324   44      1  Young     Male
## 485       485   44      1  Young     Male
## 761       761   44      1  Young     Male
## 246       246   45      0  Young   Female
## 23         23   46      1  Young     Male
## 220       220   46      1  Young     Male
## 376       376   46      0  Young   Female
## 895       895   46      1  Young     Male

creates a new dataset containing rows sorted from youngest to oldest.

The statement:

newdata <- rdatos[order(rdatos$GenderRN, rdatos$Age0),]
newdata[1:10,]

##     managerID Age0 Gender Agecat GenderRN
## 246       246   45      0  Young   Female
## 376       376   46      0  Young   Female
## 153       153   47      0  Young   Female
## 191       191   47      0  Young   Female
## 304       304   47      0  Young   Female
## 583       583   47      0  Young   Female
## 629       629   47      0  Young   Female
## 793       793   47      0  Young   Female
## 331       331   48      0  Young   Female
## 348       348   48      0  Young   Female

sorts the rows into female followed by male, and youngest to oldest within each gender.

Also, we can sort the rows by gender, and then from oldest to youngest manager within each gender.

newdata <-rdatos[order(rdatos$GenderRN, -rdatos$Age0),]
newdata[1:10,]

##     managerID Age0 Gender      Agecat GenderRN
## 9           9   72      0       Elder   Female
## 615       615   72      0       Elder   Female
## 988       988   72      0       Elder   Female
## 25         25   71      0       Elder   Female
## 70         70   71      0       Elder   Female
## 129       129   71      0       Elder   Female
## 662       662   71      0       Elder   Female
## 631       631   70      0 Middle Aged   Female
## 798       798   69      0 Middle Aged   Female
## 946       946   69      0 Middle Aged   Female

13.6 Merging datasets

If your data exist in multiple locations, you’ll need to combine them before moving forward.

13.6.1 Adding columns

To merge two data frames (datasets) horizontally, you use the merge() function. In most cases, two data frames are joined by one or more common key variables.

Example:

dataframeA <- data.frame(ID= seq(1,100,1), ID2= seq(1,100,1), X1=runif(100), X2=rnorm(100), Country=rep(c("Spain", "EEUU", "Iceland"), times=c(50,25,25)))
dataframeB <- data.frame(ID= seq(1,100,1), Y1=rnorm(100), Y2=rnorm(100),  Country=rep(c("Spain", "EEUU", "Iceland"), times=c(30,45,25)))

To merge dataframeA and dataframeB by ID:

total <- merge(dataframeA, dataframeB, by="ID")

To merge dataframeA and dataframeB by ID2 and ID:

total <- merge(dataframeA, dataframeB, by.x="ID2", by.y="ID")

Similarly, to merge two data frames by ID and Country:

total <- merge(dataframeA, dataframeB, by=c("ID","Country"))

If you’re joining two matrices or data frames horizontally and don’t need to specify a common key, you can use the cbind() function:

total <- cbind(dataframeA, dataframeB)

13.6.2 Adding rows

Vertical concatenation is typically used to add observations to a data frame. To join two data frames (datasets) vertically, use the rbind() function.

total <- rbind(dataframeA, dataframeB)

The two data frames must have the same variables, but they don’t have to be in the same order. If dataframeA has variables that dataframeB doesn’t, then before joining them do one of the following:

Delete the extra variables in dataframeA
Create the additional variables in dataframeB and set them to NA (missing)

13.7 Subsetting datasets

R has powerful indexing features for accessing the elements of an object. These features can be used to select and exclude variables, observations, or both.

13.7.1 Selecting (keeping) variables

It’s a common practice to create a new dataset from a limited number of variables chosen from a larger dataset. The elements of a data frame are accessed using the notation dataframe[row indices, column indices]. You can use this to select variables.

Example:

The statement:

newdata <- rdatos[, c(1:2)]
head(newdata)

##   managerID Age0
## 1         1   53
## 2         2   53
## 3         3   63
## 4         4   52
## 5         5   58
## 6         6   56

selects the first three variables from the rdatos data frame and saves them to the data frame newdata. Leaving the row indices blank (,) selects all the rows by default.

The statement

myvars <- c("managerID", "Age0")
newdata <-rdatos[myvars]
head(newdata)

##   managerID Age0
## 1         1   53
## 2         2   53
## 3         3   63
## 4         4   52
## 5         5   58
## 6         6   56

accomplish the same variable selection. Here, variable names (in quotes) have been entered as column indices, thereby selecting the same columns.

13.7.2 Excluding (dropping) variables

There are many reasons to exclude variables. For example, if a variable has several missing values, you may want to drop it prior to further analyses. You could exclude variable Gender with the statement

myvars <- names(rdatos) %in% c("Gender")
newdata <- rdatos[!myvars]
head(newdata)

##   managerID Age0      Agecat GenderRN
## 1         1   53 Middle Aged   Female
## 2         2   53 Middle Aged     Male
## 3         3   63 Middle Aged     Male
## 4         4   52 Middle Aged     Male
## 5         5   58 Middle Aged   Female
## 6         6   56 Middle Aged     Male

Knowing that Gender is the 3rd variable, you could exclude them with the statement

newdata <- rdatos[,-3]
head(newdata)

##   managerID Age0      Agecat GenderRN
## 1         1   53 Middle Aged   Female
## 2         2   53 Middle Aged     Male
## 3         3   63 Middle Aged     Male
## 4         4   52 Middle Aged     Male
## 5         5   58 Middle Aged   Female
## 6         6   56 Middle Aged     Male

13.7.3 Selecting observations

Selecting or excluding observations (rows) is typically a key aspect of successful data preparation and analysis.

Examples

Ask for rows 1 through 3 (first three observations)

newdata <- rdatos[1:3,] 
head(newdata)

##   managerID Age0 Gender      Agecat GenderRN
## 1         1   53      0 Middle Aged   Female
## 2         2   53      1 Middle Aged     Male
## 3         3   63      1 Middle Aged     Male

Select all Men over 55:

newdata <- rdatos[which(rdatos$GenderRN=="Male" & rdatos$Age0 > 55),]
head(newdata)

##    managerID Age0 Gender      Agecat GenderRN
## 3          3   63      1 Middle Aged     Male
## 6          6   56      1 Middle Aged     Male
## 13        13   60      1 Middle Aged     Male
## 15        15   58      1 Middle Aged     Male
## 19        19   60      1 Middle Aged     Male
## 21        21   56      1 Middle Aged     Male

13.7.4 The subset() function

The subset function is probably the easiest way to select variables and observations.

Examples

Select all rows that have a value of age greater than or equal to 55 or age less than 84. You keep the variables managerID, GenderRN and Age0.

newdata <- subset(rdatos, Age0 >= 55 | Age0 < 84, select=c(managerID, GenderRN, Age0))
head(newdata)

##   managerID GenderRN Age0
## 1         1   Female   53
## 2         2     Male   53
## 3         3     Male   63
## 4         4     Male   52
## 5         5   Female   58
## 6         6     Male   56

14 Basic Plots

Generation of random data for the examples

We generate a vector x of 15 values from a normal random distribution with mean=0 and standard deviation=1 and a vector y of 15 values from a normal random distribution with mean=0.2 and standard deviation=1:

set.seed(123)
x<-rnorm(15,0,1)
i<-c(1:15)
set.seed(4)
y<-rnorm(15,0.2,1)

14.1 Line charts

The simplest graphical representation of a numerical variable is the line chart provided by the command plot()

plot(x)  #plots the values of x (vertical axis) as a function of the index of each value (horizontal axis)

col: specifies the color of the points

col=1 black
col=2 red
col=3 green
col=4 blue
col=5 light blue

plot(x,col=3) #color=green

pch: specifies the symbol for the points

pch=1 symbol=cercle

pch=2 symbol=triangle

pch=3 symbol=plus

pch=4 symbol=star

pch=5 symbol=dyamond

plot(x,col=3, pch=4) # color=green, symbol=star

main adds a title to the plot. main is specified within the plot() function

plot(x,col=3, main="line chart") #color=green

title adds a title to the plot. title is specified outside the plot() function

plot(x,col=3)
title("line chart")

type: You can add lines between the points of different types

plot(x, type="o")  # add lines between points

plot(x, type="b")  # add lines between points without touching the points

xlab specifies the label of the x axis

plot(x, type="o", xlab="this is the x label")

plot(x, type="o", xlab="")  # removes the x lable

points(): add additional points to an existing plot

plot(x, type="o", xlab="")  # removes the x lable
points(y, col=2)      # add points y in red (col=2)

lines(): add points and lines to an existing plot

plot(x, type="o", xlab="")  # removes the x lable
lines(y, type="o", col=2)

ylim: define the limits of y axis

plot(x, type="o", xlab="", ylim=c(-3, 3))  
lines(y, type="o", col=2)

range: range(x,y)=(min(x,y), max(x,y)) among two samples x and y

range(x,y)

## [1] -1.265061  2.096540

plot(x, type="o", xlab="", ylim=range(x,y))
lines(y, type="o", col=2)

14.2 Dot plot

dotchart provides a dot plot of a numerical variable. A dot plot is similar to a line chart but the values of the variable are in the horizontal axis and the indeces in the vertical axis

dotchart(x, labels=i)

The following dot plot provides a graphical representation of the ranking of variable x

dotchart(x[order(x)], labels=order(x))

14.3 Histograms

hist(): The histogram is one of the main important plots of a numerical variable

hist(x)  # title and x label are included by default

hist(x, col=5)

14.4 Boxplot

boxplot()

boxplot(x)
title("boxplot of x", ylab="x")

boxplot(x, horizontal=T)
title("boxplot of x", xlab="x")

This is an example of a plot containing both a histogram and a boxplot:

hist(x,main='histogram and boxplot of x',xlab='x')

ylim: range of the y axis

we need to extend the y axis in order to make room for the boxplot

hist(x,main='histogram and boxplot of x',xlab='x',ylim=c(0,12))

add=T : this allows the addition of the boxplot in the histogram

axes=F: we remove the axes of the boxplot

hist(x,main='histogram and boxplot of x',xlab='x',ylim=c(0,12)) 
boxplot(x,horizontal=TRUE,at=10,add=TRUE,axes=FALSE)

boxwex: specifies the width of the box

hist(x,main='histogram and boxplot of x',xlab='x',ylim=c(0,12)) 
boxplot(x,horizontal=TRUE,at=10,add=TRUE,axes=FALSE, boxwex = 5)

14.5 Scatter plots

plot() applied to two variables provides a scatter plot

y<-x^2   # Define y as the squared values of x
plot(x,y)

plot(x,y, main="scatter plot x and y")

14.6 Multiple Data Sets on One Plot

points(): add additional points to an existing plot

set.seed(123)
x<-rnorm(15,0,1)
y<-x^2
plot(x,y)

plot(x,y)
x1 <- c(-1,1)  # we add points x=1 and x=-1
y1 <- x1^2
points(x1,y1,col=2)

plot(x,y)
points(x1,y1,col=2, pch=3) # symbol=plus

We add lables to the points

plot(x,y)
points(x1,y1,col=2, pch=3) # symbol=plus
text(x1+0.1, y1+0.2, col=2, c("A", "B"))

plot(x,y)
points(x1,y1,col=2, pch=3) # symbol=plus

legend: x and y coordinates on the plot to place the legend followed by a list of labels to use

plot(x,y)
legend(-1,2,c("Original","new"),col=c(1,2),pch=c(1,4))

col.main: specifies the color of the title

plot(x,y)
points(x1,y1,col=2, pch=4) # symbol=star
legend(-1,3,c("Original","new"),col=c(1,2),pch=c(1,4))
title("scatter plot", col.main=3)

font.main: specifies the font of the title

plot(x,y)
points(x1,y1,col=2, pch=4) # symbol=star
legend(-1,3,c("Original","new"),col=c(1,2),pch=c(1,4))
title("scatter plot", font.main=3)

cex(): Proportion of reduction or amplification of a font

plot(x,y)
points(x1,y1,col=2, pch=4) # symbol=star
legend(-1,3,c("Original","new"),col=c(1,2),pch=c(1,4), cex=0.7)
#font.main: specifies the font of the title
title("scatter plot", font.main=3)

axes=F: remove axes

axis(): define new axes

plot(x,y,axes=FALSE)
points(x1,y1,col=2, pch=4) # symbol=star
axis(1,pos=c(-0.5,0),at=seq(-2,2,by=0.4))
axis(2,pos=c(-1.5,-0.5),at=seq(0,3,by=0.5))

ann=F: removes annotation

lab: define new lables

plot(x,y,axes=FALSE, ann=F)
points(x1,y1,col=2, pch=4) # symbol=star
axis(1,pos=c(-0.5,0),at=seq(min(x),max(x),by=0.8), lab=c("a", "b", "c", "d"))
axis(2,pos=c(-1.5,-0.5),at=seq(0,3,by=0.5))

abline(): abline(a,b) adds line y=a+bx

lty: type of line

plot(x,y,axes=FALSE, ann=F)
points(x1,y1,col=2, pch=4) # symbol=star
axis(1,pos=c(-0.5,0),at=seq(min(x),max(x),by=0.8), lab=c("a", "b", "c", "d"))
axis(2,pos=c(-1.5,-0.5),at=seq(0,3,by=0.5))
abline(1, -0.7, lty=3)

box(): creates a box around the plot

plot(x,y,axes=FALSE, ann=F)
points(x1,y1,col=2, pch=4) # symbol=star
axis(1,pos=c(-0.5,0),at=seq(min(x),max(x),by=0.8), lab=c("a", "b", "c", "d"))
axis(2,pos=c(-1.5,-0.5),at=seq(0,3,by=0.5))
abline(1, -0.7, lty=3)
box()

14.7 Multiple scatter plots

pairs(): Scatter plots of all pair of variables in a data frame

set.seed(123)
data<-data.frame(x1=runif(10), x2=runif(10), x3=runif(10), x4=runif(10))
pairs(data)

14.8 Bar charts

barplot()

group<-c("A","B","C")
freq<-c(20, 50, 30)
barplot(freq)

names.arg

barplot(freq, names.arg=group)

density

barplot(freq, names.arg=group, density=c(5,30,70))

border

barplot(freq, names.arg=group, density=c(5,30,70), border=3)

14.9 Multiple bar plot

group<-c("A","B","C")
freq1<-c(20, 50, 30)
freq2<-c(40, 20, 10)
freq<-rbind(freq1,freq2)
freq

##       [,1] [,2] [,3]
## freq1   20   50   30
## freq2   40   20   10

names.arg

barplot(freq, names.arg=group)

barplot(freq, names.arg=group, col=c(3,4))

beside

barplot(freq, names.arg=group, beside=TRUE)

barplot(freq, names.arg=group, beside=TRUE, col=c(2,3))

14.10 Important graphical functions and parameters

par(): set graphical parameters

Before you change the graphical parameters it is convenient to store the default values

defaultpar<-par()

Example:

par(mfrow=c(2,3)) # puts 6 pictures in a plot distributed in 2 rows and 3 columns
hist(data$x1, main="Histogram x1")
hist(data$x2, main="Histogram x2")
hist(data$x3, main="Histogram x3")
boxplot(data$x1, main="Boxplot x1")
boxplot(data$x2, main="Boxplot x2")
boxplot(data$x3, main="Boxplot x3")

par(defaultpar)  # reset the default graphical parameters

Important arguments:

mfrow: number of pictures per row and column in a plot

mar: specifies the margin sizes around the plotting area in order: c(bottom, left, top, right)

col: color of symbols

pch: type of symbols, samples: example(points)

lwd: size of symbols

cex.*: control font sizes

14.11 Saving Graphics to Files

pdf() redirect the plots to a pdf file. Similarly: jpeg, png, ps, tiff

dev.off() shuts down the specified device

Example:

pdf("color_chart.pdf")  # creates a pdf file containing the following plot, a color chart

plot(1, 1, xlim=c(1,5.5), ylim=c(0,7), type="n", ann=FALSE)
text(1:5, rep(6,5), labels=c(0:4), cex=1:5, col=1:5)
points(1:5, rep(5,5), cex=1:5, col=1:5, pch=0:4)
text((1:5)+0.4, rep(5,5), cex=0.6, (0:4))
points(1:5, rep(4,5), cex=2, pch=(5:9))
text((1:5)+0.4, rep(4,5), cex=0.6, (5:9))
points(1:5, rep(3,5), cex=2, pch=(10:14))
text((1:5)+0.4, rep(3,5), cex=0.6, (10:14))
points(1:5, rep(2,5), cex=2, pch=(15:19))
text((1:5)+0.4, rep(2,5), cex=0.6, (15:19))

points((1:6)*0.8+0.2, rep(1,6), cex=2, pch=(20:25))
text((1:6)*0.8+0.5, rep(1,6), cex=0.6, (20:25))

dev.off()

15 Interesting references:

R Tutorial. An R Introduction to Statistics

QUICK-R

Cookbook for R

Gaston Sanchez blog

An Introduction to R from R Development Core Team

Programming in R by T.Girke

R & Bioconductor by T.Girke

16 Exercises / Homework =)

16.1 Part I

Create a new script called “intro_R_xxx.R” (replace xxx by your surname) that contains the code of the following exercises.

Code to execute a script called “myscript.R”
Code to assign the value A to a variable x
Code to generate a sequence from 7 to 30 with increment 3
Code to obtain information about function glm
Code to list all the objects in the current environment
Code to remove all objects
Code to specify the following path to the working directory: C:
Create a vector x containing the numbers 1, 2, 1, 1, 1, 2
Create a vector y containing the words yes, no, no, yes, no
Compute the number of elements in vector y
Code to obtain the sequende of integer numbers from 10 to 25
Use the function rep() to generate the sequence 1, 2, 1, 2, 1, 2
Code to generate the sequence 1, 1, 1, 2, 2, 2
Code to generate a sequence containing 7 yes and 5 no
Code to obtain the sequence 40, 35, 30, 25, 20, 15, 10

16.2 Part II

Read file example.txt and store it in a data frame called “example”
Show rows 5,11,18 and 20 in data “example”
Show variable “sex” for rows from 15 to 50 in data “example”
Change the name of the “cc” column to “Case/Control” in data “example”
Export “example” to a “csv” semicolon delimited file without the names of the rows and without quotations
Retrieve the forth element in vector x=(3, -1, 0, 2, -5, 7, 1)
Retrieve the first, second and fifth elements in vector x=(3, -1, 0, 2, -5, 7, 1)
Retrieve all the elements in vector x=(3, -1, 0, 2, -5, 7, 1) except the second one
Change the value of the first and second elements in x=(3, -1, 0, 2, -5, 7, 1) by 0
Assign the value 0 to the elements in x=(3, -1, 0, 2, -5, 7, 1) that are larger than 2
Create a matrix M with 4 rows and 3 columns and fill it by rows with even numbers from 2 to 24
Obtain the number of rows and columns of matrix M
Retrieve the element in the first row and third column of matrix M
Retrieve all the elements in the third column of matrix M
Retrieve the third and forth elements in the second column of matrix M
Retrieve a matrix containing all files in M except the first one
Add a new column at the beginning of matrix M with the integers from 1 to 4
Add a row at the end of matrix M with values 2, 4, 8
Generate a data frame called chol (for cholesterol) containing the following variables (columns): id=(1, 2, 3, 4, 5), gender=(1, 1, 2, 1, 2), LDL=(237, 256, 198, 287, 212)
In the previous data frame chol use the function rownames() to assign to each row the name of the patient: John, Peter, Hellen, Mat and Mary
Show the first 3 rows in data frame chol
Retrieve the LDL cholesterol levels of Peter his position in the data frame
Retrieve the LDL cholesterol levels of Peter using his name in the code
Save the LDL cholesterol levels of the 5 individuals in a new vector called ldl_chol
Create a new data frame named chol_high including only those individuals with LDL levels above 240
Let’s consider the vector x=(0.6, -1.3, 0.98, -0.4, 0.16) and perform a t.test on x for the null hypothesis that the mean is equal to 0 and save the output in an object called ttestx
Show the attributes of object ttestx
From the output of ttestx retrieve the confidence interval of the mean
Check the data type of gender in data frame chol
Transform variable gender from data frame chol into a factor variable called gender1 with 1=male and 2=female and with males as the reference group:
Transform variable gender from data frame chol into a factor variable called gender2 with 1=male and 2=female and with females as the reference group:
Write the data frame chol into a text file called cholesterol.txt
Write the data frame chol into a csv file called cholesterol.csv
Write the code to install and load the R package mbmdr from CRAN
Get a numerical summary of x=(1.5, 2.3, 4, 5.6, 2.1)
Obtain the 40% percentile of x=(1.5, 2.3, 4, 5.6, 2.1)
Obtain the percentages of males and females in gender=(1, 1, 2, 1, 2)
Obtain the Pearson and Spearman correlation coefficient between x=1:10 and y=x^2
Test for the equality of variances in LDL cholesterol levels between males and females, assuming that LDL levels are normally distributed
Test for differences in LDL cholesterol mean levels between males and females, assuming that LDL levels are normally distributed
Test for differences in LDL cholesterol mean levels between males and females, without the assumtion of normallity
Plot a histogram of x<-rnorm(100, 10, 4).
Save the previous histogram in a pdf file called histogram.pdf

Introduction to R programming

Natalia Vilor-Tejedor (nvilor@barcelonabeta.org)

1 Why R?

2 Creating a new project directory in RStudio

2.1 RStudio Interface

2.2 Organizing your working directory & setting up

2.2.1 Viewing your working directory

2.2.2 Console command prompt

3 Simple computations with R

3.1 Math functions

4 Variable Assignment

5 Basic Data types

6 Basic Structures

7 The R environment

8 Using scripts

9 Basic base functions

10 Reading and writing data files

11 Libraries/Packages

11.1 Install packages from CRAN

11.2 Install packages from GitHub

12 Help and documentation

13 Basic data management

13.1 Creating new variables

13.2 Recoding variables

13.3 Renaming variables

13.4 Missing values

13.4.1 Excluding missing values from analyses

13.5 Sorting data

13.6 Merging datasets

13.6.1 Adding columns

13.6.2 Adding rows

13.7 Subsetting datasets

13.7.1 Selecting (keeping) variables

13.7.2 Excluding (dropping) variables

13.7.3 Selecting observations

13.7.4 The subset() function

14 Basic Plots

14.1 Line charts

14.2 Dot plot

14.3 Histograms

14.4 Boxplot

14.5 Scatter plots

14.6 Multiple Data Sets on One Plot

14.7 Multiple scatter plots

14.8 Bar charts

14.9 Multiple bar plot

14.10 Important graphical functions and parameters

14.11 Saving Graphics to Files

15 Interesting references:

16 Exercises / Homework =)

16.1 Part I

16.2 Part II