R Introduction

R is a powerful language and environment for statistical computing and graphics. It is a public domain (a so called “GNU”) project which is similar to the commercial S language and environment which was develo-ped at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S, and is much used in as an educational language and research tool. The main advantages of R are the fact that R is freeware and that there is a lot of help available online. It is quite similar to other programming packages such as MatLab (not freeware), but more user-friendly than programming languages such as C++ or Fortran. You can use R as it is, but for educational purposes we prefer to use R in combination with the RStudio interface (also freeware), which has an organized layout and several extra options.

Getting started
1. install R.
2. install Rstudio.
3. R layout: console window | editor window | environment/history window | files/plots/packages/help window.
Rstudio Layout

Rstudio Layout

1. Set working directory

Before you start working, please set your working directory to where all your data and script files are or should be stored.


or manually choose directory (be mind of the pop-out window)


2. Install/load library

R can do many statistical and data analyses. They are organized in so-called packages or libraries. With the standard installation, most common packages are installed.

install.packages("wordcloud") #Install the package
library(wordcloud) #load the package before use it

3. R basics

Some first examples of R command

print("hello world") #how to run the commend line (click Run OR control+Enter)
10^2 + 36 #calculations
sqrt(9) #square root
3 < 4 #logic expression
2 + 2 == 4 #double-equal sign for equal in mathematic expressions

4. R environment

you can save values in a variable

x = 85 # OR x <- 85
#<- or = means assign in R

You can see that “x”" appears in the workspace window, which means that R now save the value in “x”

x #require x, show value in x
x*5 # do calculations with x
x = x +5 #assign x with new value

exercise: calculate bmi

height = 1.75
weight = 60
bmi = weight/height^2 #^caret
## [1] 19.59184

check whether your bmi fall into normal range

bmi > 18.5 & bmi < 25 #range of normal weight
## [1] TRUE
bmi < 18.5 | bmi > 25 #RANGE FOR NOT SO GOOD value
## [1] FALSE

5. Functions

You call a function by typing its name, followed by one or more arguments to that function Let’s try using the sum function

sum(1,2,4,7) #sum() is a function
rep("penny",times=3) #times=3 is arguement specifying the function rep

the most useful function in R


6. Data structures

vector,matrix,and data frame
* vector: a list of values, also called arrays, 1-dimensional; A vector’s values can be numbers, strings, logical values, as long as they’re all the same type.
* Matrix: 2-dimensional data structure including rows and column.
* data.frame: same format with matrix, the difference columns in data frame can be different data types.
* list: a container, can be mixture of data structure
Source: Kabacoff (2011) R in Action

6.1. Vector

x = c() #the c function combine Values into a Vector or List
x = c(1,2,4,7)
x #request x
y = c("a","b","c","d") #array of characters
y[2] #access the second value in y
y[c(2,3)] #access multiple values
y[2] = "cat" #assign new value to the third value in y
y[4:5] = c("dog","bear") #assign new value to y
#Now try to access the 2nd, 4th, and 5th words
y[c(2,4,5)] # not y[2,4,5]

6.2. matrix
Matrices are nothing more than 2-dimensional vectors. To define a matrix, use the function matrix:

m = matrix()
m = matrix(0,3,4)

#use a vector to initialize a matrix's value, and transform it into a 3 by 4 matrix
x = 1:12
m = matrix(x,3,4) #transform vector into matrix

#Try getting a value from the matrix:

#assign with new value
m[2,3] = 0

#get an entire row of the matrix:

#OR the entire column:

#read multiple rows or columns:

6.3. data frame
data frame is a data set that includes multiple types of data, such as numeric and string. A data frame is a matrix with variable names above the columns, visually, it looks like this:

df = data.frame()
#manually input three vectors
age = c(20,25,30)
gender = c("male","female","male")
score = c(65,75,85)

# create new data frame using function 'data.frame'
df = data.frame(age,gender,score)

#data frame subsetting:
#using $ to request certain column by name
names(df) #request all the variable names in df

# add a new variable named 'midterm' to the data frame
df$midterm = c(7,8,9)
#create a new variable based on existing variables in the data frame
df$sum = df$score + df$midterm

list is a container that can contain all types of data, include lists:

L = list(v1=x,v2=y,matrix=m,dataframe=df)
L #show values in ls
L[[1]] #request values in list
L$v1 #request values by calling name

7. Class

check the type of values in the data or variable. An value in R can have several types of ‘class’. The most important three are ‘numeric’, ‘character’ and datetime. You can ask R what class a certain variable is by typing class().

class(20) #numeric
## [1] "numeric"
class("male") #character
## [1] "character"
d = Sys.Date()
class(d) #Date
## [1] "Date"

8. Programming tools

8.1. if statement

#use the bmi example again
bmi = 19
if (bmi> 18.5 & bmi < 25) 
  print("your bmi is normal")
  } else 
    print("not normal")

8.2. for loops

h = seq(from=1, to=8) 
s = NULL
for(i in 1:length(h)){
  s[i] = h[i] * 10

Data Management

1. export data

data(mtcars) #load the defalt dataset 'mtcatrs' in R
write.csv(mtcars,"mtcars.csv",row.names = FALSE) 
#row.names=FALSE, means dont write row names
#prefer csv, can be easily edit and read by other software

write.table(mtcars,"mtcars.txt",row.names = FALSE) #export data in .txt

2. import csv file

mydata = read.csv("mtcars.csv",header=TRUE)
#header=TRUE, means the first row is header

mydata = read.table("mtcars.txt",header=TRUE,sep=",")

3. rename columns in data frame

names(mydata)[1] = "fuel_economy"

4. recode data

# recode the engine displacement into three categories: low, medium, high
mydata$rank[mydata$disp <= 160] = "L"
mydata$rank[mydata$disp > 160 & mydata$disp <= 300] = "M"
mydata$rank[mydata$disp > 300] = "H"

5. sunsetting dataset

#Selecting/keeping variables
newdata1 = mydata[,c(1,3)] #keep the first and third columns
newdata1 = mydata[,c("fuel economy","rank")] #keep the first and third columns
# Dropping variables
newdata2 = mydata[,c(-2:-5)] #drop the second to the fifth column in the dataframe
#Selecting observations
mydata[mydata$rank == "H",] #select rows which rank equal to 'H'
mydata[mydata$wt > 4,] #select rows which wt larger than 4

6. deal with misssing value

mydata$cyl[5] = NA #assign a NA to the dataset, NA means missing value
sum(mydata$cyl) #reture NA, because there is a missing value in the vector
sum(mydata$cyl,na.rm=TRUE) #na.rm means remove missing value equal to TRUE

which(is.na(mydata$cyl))# identify the NA values

7. inspect data

head(mydata) #showing the first 6 rows
str(mydata) #Display the Structure of the data
summary(mydata) #descirptives of the data
class(mydata$cyl) #data class
table(mydata$cyl) #frequency

You are refer to this cheat sheet for all the R basics https://www.rstudio.com/wp-content/uploads/2016/10/r-cheat-sheet-3.pdf


  1. Paul Torfs & Claudia Brauer: A (very) short introduction to R.
  2. Quick-R: https://www.statmethods.net/