has a number of advantages compared to other statistical software tools (i.e. SAS, SPSS, Stata, etc …).
Open-source (It is free!!!)
Cross-platform (Windows, Mac, Linux)
Updated regularly
Extremely flexible and can do or be made to do joust about anything
Amazing graphical capabilities
Let’s create a new project directory for our “Introduction to R” lesson today.
File menu and select New Project.New Project window, choose New Directory. Then, choose Empty Project. Name your new directory Intro-to-R and then “Create the project as subdirectory of:” the Desktop (or location of your choice).Create Project.File menu, select Open Project, and choose Intro-to-R.Rproj.File menu and select New File, and select R Script.The RStudio interface has four main panels:
Before we organize our working directory, let’s check to see where our current working directory is located by typing into the console:
getwd()
If you wanted to choose a different directory to be your working directory, you could navigate to a different folder in the Files tab, then, click on the More dropdown menu and select Set As Working Directoryor typing into the console:
setwd(path)
Interpreting the command prompt can help understand when R is ready to accept commands. Below lists the different states of the command prompt and how you can exit a command:
Console is ready to accept commands: >.
If R is ready to accept commands, the R console shows a > prompt.
When the console receives a command (by directly typing into the console or running from the script editor (Ctrl-Enter), R will try to execute it.
After running, the console will show the results and come back with a new > prompt to wait for new commands.
Console is waiting for you to enter more data: +.
If R is still waiting for you to enter more data because it isn’t complete yet, the console will show a + prompt. It means that you haven’t finished entering a complete command. Often this can be due to you having not ‘closed’ a parenthesis or quotation.
Escaping a command and getting a new prompt: esc
If you’re in Rstudio and you can’t figure out why your command isn’t running, you can click inside the console window and press esc to escape the command and bring back a new prompt >.
R can be used as a calculator. You can just type your equation and execute the command:
2+2
1+2*3-4/5
(19465*0.25)^23
5%%2
Example: Addition of two values
3 + 6
## [1] 9
# The output is always preceded by a number between brackets: [1]
log(x) Natural log.
sum(x) Sum.
exp(x) Exponential.
mean(x) Mean.
max(x) Largest element.
median(x) Median.
min(x) Smallest element.
quantile(x) Percentage quantiles.
round(x, n) Round to n decimal places.
rank(x) Rank of elements.
var(x) The variance.
cor(x,y) Correlation.
sd(x) The standard deviation.
You can assign a number to a name.
x <- 3
Now “x” is called a variable and it appears in the workspace window, which means that R stores the value of “x” in its memory and it can be used later.
In general, by using the <-, you can assign a value to an object
If you type the name of a variable, the current value of the variable will be printed
x
## [1] 3
There are variables that are already defined in R, like variable “pi”
pi
## [1] 3.141593
Calculating the perimeter of the circumference with radius 3
2 * pi * x
## [1] 18.84956
Changing the value of radius and reusing the code
x <- 5
2 * pi * x
## [1] 31.41593
Remarks
A <- 33
a <- 44
A
## [1] 33
a
## [1] 44
x <- c(1, 2, 3);
x
## [1] 1 2 3
is.numeric(x)
## [1] TRUE
x <- c("1", "2", "3"); x
## [1] "1" "2" "3"
is.character(x)
## [1] TRUE
x <- 1:10 < 5
x
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
!x
## [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
which(x) # Returns index for the 'TRUE' values in logical vector
## [1] 1 2 3 4
factor(c("1", "0", "1", "0", "1"))
## [1] 1 0 1 0 1
## Levels: 0 1
myVec <- 1:10; names(myVec) <- letters[1:10]
class(myVec)
## [1] "integer"
myMA <- matrix(1:30, 3, 10, byrow = TRUE)
class(myMA)
## [1] "matrix" "array"
myDF <- data.frame(Col1=1:10, Col2=10:1)
class(myDF)
## [1] "data.frame"
myL <- list(name="Fred", wife="Mary", no.children=3, child.ages=c(4,7,9))
class(myL)
## [1] "list"
ls() List all variables in the environment.
rm(x) Remove x from the environment.
rm(list = ls()) Remove all variables from the environment.
Remarks You can use the environment panel in RStudio to browse variables in your environment.
An R script is a text file where you can type the commands that you want to execute in R.
Scripts have file names with the extension .R, for instance, “myscript.R”. The R script is where you keep a record of your work. Using R scripts is very convenient because all R commands used in a session are saved in the script file and can be executed again in a future session.
To create a new R script go to
File -> New -> R Script
To open an existing R script go to
File -> Open -> R Script->select your script
If you want to run a line from the script window, you place the cursor in any place of the line and click Run or press CTRL+ENTER if you are using Windows/Linux or Command+Enter if you are using MAC.
You can execute part of a script (or the whole script) by selecting the corresponding lines and pressing Run or CTRL+ENTER or Command+Enter.
You can also execute the whole script by using the R function source( )
source("scriptname.R")
Usual tasks in R involve functions. R comes with a slew of pre-installed functions. These functions are installed as part of the base package which is located in your ‘’library’ directory.
An R function is used by typing its name followed by its arguments (also called parameters) between parentheses.
Example: seq( ) is a function for generating a sequence of numbers. Its arguments are arg1=from, which specifies the first number of the sequence, arg2= to, last number of the the sequence, and arg3=by, the increment of the sequence.
seq(10,80, 2) # generates a sequence from 10 to 20 with increment 2
## [1] 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58
## [26] 60 62 64 66 68 70 72 74 76 78 80
The number between brackets at the beginning of each line of the output indicates the position of the first element of each row: 10 is the first element of the output and 56 is the 24th element of the output.
For example, complex operations often requires an input and gives and output. This is done with the function ‘print()’.
print("Hello world")
print(exp)
read.table( ): this function reads a text data file and stores the values into a data frameheader=T : first row of the data file contains the names of the columns or variables sep="" : the values of each row are separated by a space sep="\t" : the values of each row are separated by a tabulation sep="," : the values of each row are separated by a comma dec=".": the decimal symbol is a point
Example
example <- read.table("treatment.txt", header=F, sep="")
# This instruction reads file "treatment.txt" and creates the dataframe "example"
write.table( ): this function writes a data frame into a text fileExample
write.table(treatment, file="treatment.txt", row.names=FALSE)
# This instruction writes the dataframe "treatment" in file "treatment.txt"
# row.names=FALSE prevents R for printint the names of the rows (or just the row numbers) in the output file
R comes with a standard set of packages. Others are available for download and installation.
Standard packages: The standard (or base) packages are considered part of the R source code. They contain the basic functions that allow R to work, and the datasets and standard statistical and graphical functions. They should be automatically available in any R installation.
Contributed packages: There are thousands of contributed packages for R, written by many different authors. Some of these packages implement specialized statistical methods. Some (the recommended packages) are distributed with every binary distribution of R.
Most packages are available for download from CRAN and other repositories such as Bioconductor, a large repository of tools for the analysis and comprehension of high-throughput genomic data.
Once installed, a package has to be loaded into the session to be used.
Using an R package or library for the first time requires two steps: installing the library and loading the library with the following functions:
install.packages(): Install the package
library(): load the package
Example: How to install and load the library “survival”?
install.packages("survival") # you only need to do this once
library(survival) # load library
To install packages grom GitHub make sure you have the newest version of the devtools package. To do so run:
install.packages("devtools")
Then you have two options:
install_github() function.install_github("Momocs",username="jfpalomeque")
install.packages() function in R with:install.packages(file_name_and_path, repos = NULL, type="source")
help( ) or ? : provides information about a function or an object
Example:
help(mean)
?mean
help.search( ): provides information about a topic
Example:
help.search("logistic regression")
In a typical research project, you’ll need to create new variables and transform existing ones. This is accomplished with statements of the form:
variable <- expression
Let’s say that you have a data frame named mydata, with variables x1 and x2,
mydata <- data.frame(x1=runif(100), x2=runif(100))
and you want to create a new variable sumx that adds these two variables and a new variable called meanx that averages the two variables. If you use the code
sumx <- x1 + x2
meanx <- (x1 + x2)/2
you’ll get an error, because R doesn’t know that x1 and x2 are from data frame mydata.
If you use this code instead, the statements will succeed but you’ll end up with a data frame (mydata) and two separate vectors (sumx and meanx).
sumx <- mydata$x1 + mydata$x2
meanx <- (mydata$x1 + mydata$x2)/2
Ultimately, you want to incorporate new variables into the original data frame:
mydata<-data.frame(x1 = c(2, 2, 6, 4), x2 = c(3, 4, 2, 8))
mydata$sumx <- mydata$x1 + mydata$x2
mydata$meanx <- (mydata$x1 + mydata$x2)/2
head(mydata)
## x1 x2 sumx meanx
## 1 2 3 5 2.5
## 2 2 4 6 3.0
## 3 6 2 8 4.0
## 4 4 8 12 6.0
Recoding involves creating new values of a variable conditional on the existing values of the same and/or other variables. For example, you may want to:
Change a continuous variable into a set of categorie.
Create a pass/fail variable based on a set of cutoff scores.
Replace miscoded values with correct values.
set.seed(123456)
rdatos <- data.frame(ID=seq(1,1000,1), Age=rbinom(n=1000,105, 0.55))
rdatos$Agecat[rdatos$Age > 70] <- "Elder"
rdatos$Agecat[rdatos$Age >= 50 & rdatos$Age <= 70] <- "Middle Aged"
rdatos$Agecat[rdatos$Age < 50] <- "Young"
head(rdatos)
## ID Age Agecat
## 1 1 53 Middle Aged
## 2 2 53 Middle Aged
## 3 3 63 Middle Aged
## 4 4 52 Middle Aged
## 5 5 58 Middle Aged
## 6 6 56 Middle Aged
Using the within function:
set.seed(123456)
rdatos <- data.frame(ID=seq(1,1000,1), Age=rbinom(n=1000,105, 0.55), Gender=rbinom(n=1000, 1, 0.5))
rdatos <- within(rdatos,{
Agecat <- NA
Agecat[Age > 70] <- "Elder"
Agecat[Age >= 50 & Age <= 70] <- "Middle Aged"
Agecat[Age < 50] <- "Young" })
rdatos <- within(rdatos,{
GenderRN <- NA
GenderRN[Gender==0] <- "Female"
GenderRN[Gender==1] <- "Male"})
head(rdatos)
## ID Age Gender Agecat GenderRN
## 1 1 53 0 Middle Aged Female
## 2 2 53 1 Middle Aged Male
## 3 3 63 1 Middle Aged Male
## 4 4 52 1 Middle Aged Male
## 5 5 58 0 Middle Aged Female
## 6 6 56 1 Middle Aged Male
Several packages offer useful recoding functions; in particular, the car package’s recode() function recodes numeric and character vectors and factors very simply.
You can change variable names. Let’s say that you want to change the variables Age to Age0 and ID to managerID. You can use the statement rename(). The format of the rename() function is
rename(dataframe, c(oldname="newname", oldname="newname",.))
library(reshape)
rdatos <- rename(rdatos, c(ID="managerID", Age="Age0"))
names(rdatos)
## [1] "managerID" "Age0" "Gender" "Agecat" "GenderRN"
Data sets are likely to be incomplete because of missed questions, faulty equipment, or improperly coded data. In R, missing values are represented by the symbol NA (not available). Impossible values (for example, dividing by 0) are represented by the symbol NaN (not a number).
R provides a number of functions for identifying observations that contain missing values. The function is.na() allows you to test for the presence of missing values.
Assume that you have a vector:
y <- c(1, 2, 3, NA)
y
## [1] 1 2 3 NA
then the function is.na(y) returns c(FALSE, FALSE, FALSE, TRUE).
We can use assignments to recode values to missing. For example, we can code missing values as 99. In this case, we must let R know that the value 99 means missing in this case (otherwise the mean age for this sample of bosses will be way off!). You can accomplish this by recoding the variable:
rdatos$age[rdatos$age == 99] <- NA
Any value of age that’s equal to 99 is changed to NA.
Once you’ve identified the missing values, you need to eliminate them in some way before analyzing your data further. The reason is that arithmetic expressions and functions that contain missing values yield missing values. For example, consider the following code:
x <- c(1, 2, NA, 3)
y <- x[1] + x[2] + x[3] + x[4]
z <- sum(x)
z
## [1] NA
Both y and z will be NA (missing) because the third element of x is missing. Luckily, most numeric functions have a na.rm=TRUE option that removes missing values prior to calculations and applies the function to the remaining values:
x <- c(1, 2, NA, 3)
y <- sum(x, na.rm=TRUE)
y
## [1] 6
Sometimes, viewing a dataset in a sorted order can tell you quite a bit about the data. To sort a data frame in R, use the order() function.
newdata <- rdatos[order(rdatos$Age0),]
newdata[1:10,]
## managerID Age0 Gender Agecat GenderRN
## 38 38 40 1 Young Male
## 725 725 43 1 Young Male
## 324 324 44 1 Young Male
## 485 485 44 1 Young Male
## 761 761 44 1 Young Male
## 246 246 45 0 Young Female
## 23 23 46 1 Young Male
## 220 220 46 1 Young Male
## 376 376 46 0 Young Female
## 895 895 46 1 Young Male
creates a new dataset containing rows sorted from youngest to oldest.
The statement:
newdata <- rdatos[order(rdatos$GenderRN, rdatos$Age0),]
newdata[1:10,]
## managerID Age0 Gender Agecat GenderRN
## 246 246 45 0 Young Female
## 376 376 46 0 Young Female
## 153 153 47 0 Young Female
## 191 191 47 0 Young Female
## 304 304 47 0 Young Female
## 583 583 47 0 Young Female
## 629 629 47 0 Young Female
## 793 793 47 0 Young Female
## 331 331 48 0 Young Female
## 348 348 48 0 Young Female
sorts the rows into female followed by male, and youngest to oldest within each gender.
Also, we can sort the rows by gender, and then from oldest to youngest manager within each gender.
newdata <-rdatos[order(rdatos$GenderRN, -rdatos$Age0),]
newdata[1:10,]
## managerID Age0 Gender Agecat GenderRN
## 9 9 72 0 Elder Female
## 615 615 72 0 Elder Female
## 988 988 72 0 Elder Female
## 25 25 71 0 Elder Female
## 70 70 71 0 Elder Female
## 129 129 71 0 Elder Female
## 662 662 71 0 Elder Female
## 631 631 70 0 Middle Aged Female
## 798 798 69 0 Middle Aged Female
## 946 946 69 0 Middle Aged Female
If your data exist in multiple locations, you’ll need to combine them before moving forward.
To merge two data frames (datasets) horizontally, you use the merge() function. In most cases, two data frames are joined by one or more common key variables.
dataframeA <- data.frame(ID= seq(1,100,1), ID2= seq(1,100,1), X1=runif(100), X2=rnorm(100), Country=rep(c("Spain", "EEUU", "Iceland"), times=c(50,25,25)))
dataframeB <- data.frame(ID= seq(1,100,1), Y1=rnorm(100), Y2=rnorm(100), Country=rep(c("Spain", "EEUU", "Iceland"), times=c(30,45,25)))
To merge dataframeA and dataframeB by ID:
total <- merge(dataframeA, dataframeB, by="ID")
To merge dataframeA and dataframeB by ID2 and ID:
total <- merge(dataframeA, dataframeB, by.x="ID2", by.y="ID")
Similarly, to merge two data frames by ID and Country:
total <- merge(dataframeA, dataframeB, by=c("ID","Country"))
If you’re joining two matrices or data frames horizontally and don’t need to specify a common key, you can use the cbind() function:
total <- cbind(dataframeA, dataframeB)
Vertical concatenation is typically used to add observations to a data frame. To join two data frames (datasets) vertically, use the rbind() function.
total <- rbind(dataframeA, dataframeB)
The two data frames must have the same variables, but they don’t have to be in the same order. If dataframeA has variables that dataframeB doesn’t, then before joining them do one of the following:
R has powerful indexing features for accessing the elements of an object. These features can be used to select and exclude variables, observations, or both.
It’s a common practice to create a new dataset from a limited number of variables chosen from a larger dataset. The elements of a data frame are accessed using the notation dataframe[row indices, column indices]. You can use this to select variables.
The statement:
newdata <- rdatos[, c(1:2)]
head(newdata)
## managerID Age0
## 1 1 53
## 2 2 53
## 3 3 63
## 4 4 52
## 5 5 58
## 6 6 56
selects the first three variables from the rdatos data frame and saves them to the data frame newdata. Leaving the row indices blank (,) selects all the rows by default.
The statement
myvars <- c("managerID", "Age0")
newdata <-rdatos[myvars]
head(newdata)
## managerID Age0
## 1 1 53
## 2 2 53
## 3 3 63
## 4 4 52
## 5 5 58
## 6 6 56
accomplish the same variable selection. Here, variable names (in quotes) have been entered as column indices, thereby selecting the same columns.
There are many reasons to exclude variables. For example, if a variable has several missing values, you may want to drop it prior to further analyses. You could exclude variable Gender with the statement
myvars <- names(rdatos) %in% c("Gender")
newdata <- rdatos[!myvars]
head(newdata)
## managerID Age0 Agecat GenderRN
## 1 1 53 Middle Aged Female
## 2 2 53 Middle Aged Male
## 3 3 63 Middle Aged Male
## 4 4 52 Middle Aged Male
## 5 5 58 Middle Aged Female
## 6 6 56 Middle Aged Male
Knowing that Gender is the 3rd variable, you could exclude them with the statement
newdata <- rdatos[,-3]
head(newdata)
## managerID Age0 Agecat GenderRN
## 1 1 53 Middle Aged Female
## 2 2 53 Middle Aged Male
## 3 3 63 Middle Aged Male
## 4 4 52 Middle Aged Male
## 5 5 58 Middle Aged Female
## 6 6 56 Middle Aged Male
Selecting or excluding observations (rows) is typically a key aspect of successful data preparation and analysis.
Ask for rows 1 through 3 (first three observations)
newdata <- rdatos[1:3,]
head(newdata)
## managerID Age0 Gender Agecat GenderRN
## 1 1 53 0 Middle Aged Female
## 2 2 53 1 Middle Aged Male
## 3 3 63 1 Middle Aged Male
Select all Men over 55:
newdata <- rdatos[which(rdatos$GenderRN=="Male" & rdatos$Age0 > 55),]
head(newdata)
## managerID Age0 Gender Agecat GenderRN
## 3 3 63 1 Middle Aged Male
## 6 6 56 1 Middle Aged Male
## 13 13 60 1 Middle Aged Male
## 15 15 58 1 Middle Aged Male
## 19 19 60 1 Middle Aged Male
## 21 21 56 1 Middle Aged Male
The subset function is probably the easiest way to select variables and observations.
Select all rows that have a value of age greater than or equal to 55 or age less than 84. You keep the variables managerID, GenderRN and Age0.
newdata <- subset(rdatos, Age0 >= 55 | Age0 < 84, select=c(managerID, GenderRN, Age0))
head(newdata)
## managerID GenderRN Age0
## 1 1 Female 53
## 2 2 Male 53
## 3 3 Male 63
## 4 4 Male 52
## 5 5 Female 58
## 6 6 Male 56
Generation of random data for the examples
We generate a vector x of 15 values from a normal random distribution with mean=0 and standard deviation=1 and a vector y of 15 values from a normal random distribution with mean=0.2 and standard deviation=1:
set.seed(123)
x<-rnorm(15,0,1)
i<-c(1:15)
set.seed(4)
y<-rnorm(15,0.2,1)
The simplest graphical representation of a numerical variable is the line chart provided by the command plot()
plot(x) #plots the values of x (vertical axis) as a function of the index of each value (horizontal axis)
col: specifies the color of the points
col=1 black
col=2 red
col=3 green
col=4 blue
col=5 light blue
plot(x,col=3) #color=green
pch: specifies the symbol for the points
pch=1 symbol=cercle
pch=2 symbol=triangle
pch=3 symbol=plus
pch=4 symbol=star
pch=5 symbol=dyamond
plot(x,col=3, pch=4) # color=green, symbol=star
main adds a title to the plot. main is specified within the plot() function
plot(x,col=3, main="line chart") #color=green
title adds a title to the plot. title is specified outside the plot() function
plot(x,col=3)
title("line chart")
type: You can add lines between the points of different types
plot(x, type="o") # add lines between points
plot(x, type="b") # add lines between points without touching the points
xlab specifies the label of the x axis
plot(x, type="o", xlab="this is the x label")
plot(x, type="o", xlab="") # removes the x lable
points(): add additional points to an existing plot
plot(x, type="o", xlab="") # removes the x lable
points(y, col=2) # add points y in red (col=2)
lines(): add points and lines to an existing plot
plot(x, type="o", xlab="") # removes the x lable
lines(y, type="o", col=2)
ylim: define the limits of y axis
plot(x, type="o", xlab="", ylim=c(-3, 3))
lines(y, type="o", col=2)
range: range(x,y)=(min(x,y), max(x,y)) among two samples x and y
range(x,y)
## [1] -1.265061 2.096540
plot(x, type="o", xlab="", ylim=range(x,y))
lines(y, type="o", col=2)
dotchart provides a dot plot of a numerical variable. A dot plot is similar to a line chart but the values of the variable are in the horizontal axis and the indeces in the vertical axis
dotchart(x, labels=i)
The following dot plot provides a graphical representation of the ranking of variable x
dotchart(x[order(x)], labels=order(x))
hist(): The histogram is one of the main important plots of a numerical variable
hist(x) # title and x label are included by default
hist(x, col=5)
boxplot()
boxplot(x)
title("boxplot of x", ylab="x")
boxplot(x, horizontal=T)
title("boxplot of x", xlab="x")
This is an example of a plot containing both a histogram and a boxplot:
hist(x,main='histogram and boxplot of x',xlab='x')
ylim: range of the y axis
we need to extend the y axis in order to make room for the boxplot
hist(x,main='histogram and boxplot of x',xlab='x',ylim=c(0,12))
add=T : this allows the addition of the boxplot in the histogram
axes=F: we remove the axes of the boxplot
hist(x,main='histogram and boxplot of x',xlab='x',ylim=c(0,12))
boxplot(x,horizontal=TRUE,at=10,add=TRUE,axes=FALSE)
boxwex: specifies the width of the box
hist(x,main='histogram and boxplot of x',xlab='x',ylim=c(0,12))
boxplot(x,horizontal=TRUE,at=10,add=TRUE,axes=FALSE, boxwex = 5)
plot() applied to two variables provides a scatter plot
y<-x^2 # Define y as the squared values of x
plot(x,y)
plot(x,y, main="scatter plot x and y")
points(): add additional points to an existing plot
set.seed(123)
x<-rnorm(15,0,1)
y<-x^2
plot(x,y)
plot(x,y)
x1 <- c(-1,1) # we add points x=1 and x=-1
y1 <- x1^2
points(x1,y1,col=2)
plot(x,y)
points(x1,y1,col=2, pch=3) # symbol=plus
We add lables to the points
plot(x,y)
points(x1,y1,col=2, pch=3) # symbol=plus
text(x1+0.1, y1+0.2, col=2, c("A", "B"))
plot(x,y)
points(x1,y1,col=2, pch=3) # symbol=plus
legend: x and y coordinates on the plot to place the legend followed by a list of labels to use
plot(x,y)
legend(-1,2,c("Original","new"),col=c(1,2),pch=c(1,4))
col.main: specifies the color of the title
plot(x,y)
points(x1,y1,col=2, pch=4) # symbol=star
legend(-1,3,c("Original","new"),col=c(1,2),pch=c(1,4))
title("scatter plot", col.main=3)
font.main: specifies the font of the title
plot(x,y)
points(x1,y1,col=2, pch=4) # symbol=star
legend(-1,3,c("Original","new"),col=c(1,2),pch=c(1,4))
title("scatter plot", font.main=3)
cex(): Proportion of reduction or amplification of a font
plot(x,y)
points(x1,y1,col=2, pch=4) # symbol=star
legend(-1,3,c("Original","new"),col=c(1,2),pch=c(1,4), cex=0.7)
#font.main: specifies the font of the title
title("scatter plot", font.main=3)
axes=F: remove axes
axis(): define new axes
plot(x,y,axes=FALSE)
points(x1,y1,col=2, pch=4) # symbol=star
axis(1,pos=c(-0.5,0),at=seq(-2,2,by=0.4))
axis(2,pos=c(-1.5,-0.5),at=seq(0,3,by=0.5))
ann=F: removes annotation
lab: define new lables
plot(x,y,axes=FALSE, ann=F)
points(x1,y1,col=2, pch=4) # symbol=star
axis(1,pos=c(-0.5,0),at=seq(min(x),max(x),by=0.8), lab=c("a", "b", "c", "d"))
axis(2,pos=c(-1.5,-0.5),at=seq(0,3,by=0.5))
abline(): abline(a,b) adds line y=a+bx
lty: type of line
plot(x,y,axes=FALSE, ann=F)
points(x1,y1,col=2, pch=4) # symbol=star
axis(1,pos=c(-0.5,0),at=seq(min(x),max(x),by=0.8), lab=c("a", "b", "c", "d"))
axis(2,pos=c(-1.5,-0.5),at=seq(0,3,by=0.5))
abline(1, -0.7, lty=3)
box(): creates a box around the plot
plot(x,y,axes=FALSE, ann=F)
points(x1,y1,col=2, pch=4) # symbol=star
axis(1,pos=c(-0.5,0),at=seq(min(x),max(x),by=0.8), lab=c("a", "b", "c", "d"))
axis(2,pos=c(-1.5,-0.5),at=seq(0,3,by=0.5))
abline(1, -0.7, lty=3)
box()
pairs(): Scatter plots of all pair of variables in a data frame
set.seed(123)
data<-data.frame(x1=runif(10), x2=runif(10), x3=runif(10), x4=runif(10))
pairs(data)
barplot()
group<-c("A","B","C")
freq<-c(20, 50, 30)
barplot(freq)
names.arg
barplot(freq, names.arg=group)
density
barplot(freq, names.arg=group, density=c(5,30,70))
border
barplot(freq, names.arg=group, density=c(5,30,70), border=3)
group<-c("A","B","C")
freq1<-c(20, 50, 30)
freq2<-c(40, 20, 10)
freq<-rbind(freq1,freq2)
freq
## [,1] [,2] [,3]
## freq1 20 50 30
## freq2 40 20 10
names.arg
barplot(freq, names.arg=group)
barplot(freq, names.arg=group, col=c(3,4))
beside
barplot(freq, names.arg=group, beside=TRUE)
barplot(freq, names.arg=group, beside=TRUE, col=c(2,3))
par(): set graphical parameters
Before you change the graphical parameters it is convenient to store the default values
defaultpar<-par()
Example:
par(mfrow=c(2,3)) # puts 6 pictures in a plot distributed in 2 rows and 3 columns
hist(data$x1, main="Histogram x1")
hist(data$x2, main="Histogram x2")
hist(data$x3, main="Histogram x3")
boxplot(data$x1, main="Boxplot x1")
boxplot(data$x2, main="Boxplot x2")
boxplot(data$x3, main="Boxplot x3")
par(defaultpar) # reset the default graphical parameters
Important arguments:
mfrow: number of pictures per row and column in a plot
mar: specifies the margin sizes around the plotting area in order: c(bottom, left, top, right)
col: color of symbols
pch: type of symbols, samples: example(points)
lwd: size of symbols
cex.*: control font sizes
pdf() redirect the plots to a pdf file. Similarly: jpeg, png, ps, tiff
dev.off() shuts down the specified device
Example:
pdf("color_chart.pdf") # creates a pdf file containing the following plot, a color chart
plot(1, 1, xlim=c(1,5.5), ylim=c(0,7), type="n", ann=FALSE)
text(1:5, rep(6,5), labels=c(0:4), cex=1:5, col=1:5)
points(1:5, rep(5,5), cex=1:5, col=1:5, pch=0:4)
text((1:5)+0.4, rep(5,5), cex=0.6, (0:4))
points(1:5, rep(4,5), cex=2, pch=(5:9))
text((1:5)+0.4, rep(4,5), cex=0.6, (5:9))
points(1:5, rep(3,5), cex=2, pch=(10:14))
text((1:5)+0.4, rep(3,5), cex=0.6, (10:14))
points(1:5, rep(2,5), cex=2, pch=(15:19))
text((1:5)+0.4, rep(2,5), cex=0.6, (15:19))
points((1:6)*0.8+0.2, rep(1,6), cex=2, pch=(20:25))
text((1:6)*0.8+0.5, rep(1,6), cex=0.6, (20:25))
dev.off()
R Tutorial. An R Introduction to Statistics
An Introduction to R from R Development Core Team
Create a new script called “intro_R_xxx.R” (replace xxx by your surname) that contains the code of the following exercises.
Code to execute a script called “myscript.R”
Code to assign the value A to a variable x
Code to generate a sequence from 7 to 30 with increment 3
Code to obtain information about function glm
Code to list all the objects in the current environment
Code to remove all objects
Code to specify the following path to the working directory: C:
Create a vector x containing the numbers 1, 2, 1, 1, 1, 2
Create a vector y containing the words yes, no, no, yes, no
Compute the number of elements in vector y
Code to obtain the sequende of integer numbers from 10 to 25
Use the function rep() to generate the sequence 1, 2, 1, 2, 1, 2
Code to generate the sequence 1, 1, 1, 2, 2, 2
Code to generate a sequence containing 7 yes and 5 no
Code to obtain the sequence 40, 35, 30, 25, 20, 15, 10
Read file example.txt and store it in a data frame called “example”
Show rows 5,11,18 and 20 in data “example”
Show variable “sex” for rows from 15 to 50 in data “example”
Change the name of the “cc” column to “Case/Control” in data “example”
Export “example” to a “csv” semicolon delimited file without the names of the rows and without quotations
Retrieve the forth element in vector x=(3, -1, 0, 2, -5, 7, 1)
Retrieve the first, second and fifth elements in vector x=(3, -1, 0, 2, -5, 7, 1)
Retrieve all the elements in vector x=(3, -1, 0, 2, -5, 7, 1) except the second one
Change the value of the first and second elements in x=(3, -1, 0, 2, -5, 7, 1) by 0
Assign the value 0 to the elements in x=(3, -1, 0, 2, -5, 7, 1) that are larger than 2
Create a matrix M with 4 rows and 3 columns and fill it by rows with even numbers from 2 to 24
Obtain the number of rows and columns of matrix M
Retrieve the element in the first row and third column of matrix M
Retrieve all the elements in the third column of matrix M
Retrieve the third and forth elements in the second column of matrix M
Retrieve a matrix containing all files in M except the first one
Add a new column at the beginning of matrix M with the integers from 1 to 4
Add a row at the end of matrix M with values 2, 4, 8
Generate a data frame called chol (for cholesterol) containing the following variables (columns): id=(1, 2, 3, 4, 5), gender=(1, 1, 2, 1, 2), LDL=(237, 256, 198, 287, 212)
In the previous data frame chol use the function rownames() to assign to each row the name of the patient: John, Peter, Hellen, Mat and Mary
Show the first 3 rows in data frame chol
Retrieve the LDL cholesterol levels of Peter his position in the data frame
Retrieve the LDL cholesterol levels of Peter using his name in the code
Save the LDL cholesterol levels of the 5 individuals in a new vector called ldl_chol
Create a new data frame named chol_high including only those individuals with LDL levels above 240
Let’s consider the vector x=(0.6, -1.3, 0.98, -0.4, 0.16) and perform a t.test on x for the null hypothesis that the mean is equal to 0 and save the output in an object called ttestx
Show the attributes of object ttestx
From the output of ttestx retrieve the confidence interval of the mean
Check the data type of gender in data frame chol
Transform variable gender from data frame chol into a factor variable called gender1 with 1=male and 2=female and with males as the reference group:
Transform variable gender from data frame chol into a factor variable called gender2 with 1=male and 2=female and with females as the reference group:
Write the data frame chol into a text file called cholesterol.txt
Write the data frame chol into a csv file called cholesterol.csv
Write the code to install and load the R package mbmdr from CRAN
Get a numerical summary of x=(1.5, 2.3, 4, 5.6, 2.1)
Obtain the 40% percentile of x=(1.5, 2.3, 4, 5.6, 2.1)
Obtain the percentages of males and females in gender=(1, 1, 2, 1, 2)
Obtain the Pearson and Spearman correlation coefficient between x=1:10 and y=x^2
Test for the equality of variances in LDL cholesterol levels between males and females, assuming that LDL levels are normally distributed
Test for differences in LDL cholesterol mean levels between males and females, assuming that LDL levels are normally distributed
Test for differences in LDL cholesterol mean levels between males and females, without the assumtion of normallity
Plot a histogram of x<-rnorm(100, 10, 4).
Save the previous histogram in a pdf file called histogram.pdf