GENERAL INFORMATION REGARDING TUTORIALS:
Directive 1: DON’T hesitate to call the attention of your Teaching Assistant. Be nice to them, they are great! Also, if they feel that your question is relevant to the others in the room, they may call your attention and bring your question to the group.
Directive 2: You don’t need to memorize commands in R. With enough practice and looking at the tutorials again, you will be able to re-use commands towards new applications.
1ST TUTORIAL; The R environment and its really basic commands.
This 1st tutorial is meant to help you get acquainted with the R environment for statistical computing.
This tutorial was produced using some code adapted from “R and Data Mining: Examples and Case Studies” by Yanchang Zhao (2013); and also from Raphael Gottardo from his lectures in R and basic statistics (Exploratory Data Analysis and Essential Statistics using R).
THE BASICS: R is a statistical and graphical analysis freeware system that is available for several platforms (Windows, Mac OS, Unix). R can be seen both as a programming language and as a package of statistical functions. Although R comes with a number of basic and graphical statistical functions such as mean, variance and histograms, a number of researchers have been developing advanced routines that can be automatically incorporated into R as functions. These functions are usually grouped into libraries that are freely available on the Internet at the R project site (http://www.r-project.org/), on developers’ homepages and in publications (e.g., electronic supplements). Given its flexibility and the fact that is free, R has become one of the main statistical packages used in many scientific applications.
A LITTLE MORE ABOUT R: R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes an effective data handling and storage facility; A suite of operators for calculations on arrays, in particular matrices; A large, coherent, integrated collection of intermediate tools for data analysis; Graphical facilities for data analysis and display either on-screen or on hardcopy, and Well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input anoutput facilities.
In Moodle, you will find an easy Installation-guide-for-R-and-RStudio (in pdf). R is the statistical environment and RStudio an Interface that makes R more friendly. OR, if you prefer a YouTube video, then go to https://www.youtube.com/watch?v=d-u_7vdag-0
Another R tutorial for beginners http://www.cyclismo.org/tutorial/R/. We can basically find tons of R tutorials on the internet, including YouTube videos.
Let’s start - R as a calculator: here we will learn some really simple operations; In the GREY zones below you will find the operation you will enter in the RStudio console and in the WHITE zone you will see the answer you will get once you hit enter in the RStudio console after entering the operation. Don’t get used to see the answers though; I’m including these answers here so that you get familiar with R as quick as possible.
1+1
## [1] 2
exp(-2)
## [1] 0.1353353
pi
## [1] 3.141593
exp(10)
## [1] 22026.47
sin(2*pi) # this should be in effect zero
## [1] -2.449294e-16
round(sin(2*pi)) # now it is
## [1] 0
-1/0 # don't be afraid
## [1] -Inf
sqrt(10)
## [1] 3.162278
Assigning values to a variable
x=2
y=2
x+y
## [1] 4
c=2
Creating a sequence of numbers
x=0:5
x
## [1] 0 1 2 3 4 5
Note that a large number of R users do not use the symbol = but rather <- which is the “assign” operator; this will not appear immediately useful to you in this course, but be aware when reading R material online and even in this course; I tend to use <- instead of = but either is fine in this course.
Re-creating a sequence of numbers
x<-0:5
x
## [1] 0 1 2 3 4 5
Your first plot in R:
x<-1:7
x
y<-11:17
y
plot(x,y)
Creating a vector by combining (function c) values:
x <- c(2,3,5,2,7,1)
x
## [1] 2 3 5 2 7 1
More on calculations and dealing with vectors; the symbol # is used to annotate the code (make comments) so that whatever text appear after # is not interpreted by R when you hit
weight<-c(60,72,75,90,95,72)
# calling a particular cell
weight[1]
## [1] 60
weight[2]
## [1] 72
weight
## [1] 60 72 75 90 95 72
height<-c(1.75,1.80,1.65,1.90,1.74,1.91)
bmi<-weight/height
bmi
## [1] 34.28571 40.00000 45.45455 47.36842 54.59770 37.69634
Some basic operations
mean(height)
## [1] 1.791667
sum(height)
## [1] 10.75
Or allocating the mean to a variable; note that you can give any name to a variable. Here I used mean.x to make it more intuitive; but you could have used doNotGetDistracted (or whatever really) instead:
mean.x<-mean(height)
mean.x
## [1] 1.791667
doNotGetDistracted<-mean(height)
doNotGetDistracted
## [1] 1.791667
We can sort values in ascending order
sort(x)
## [1] 1 2 2 3 5 7
or descending order
sort(x,decreasing = TRUE)
## [1] 7 5 3 2 2 1
Above, we learned something about R functions; they have “default” modes; the default mode of the function sort is to rank numbers in ascending order. BUT, you can change the default mode by switching the parameter “decreasing” to TRUE. As such, the default of the parameter “decreasing” in the function sort is FALSE. This can be found by asking information about the function as follows:
? sort
We will learn a lot about functions in the tutorials, so don’t worry for now about their defaults and how they work.
We are almost done for today. Let’s learn now how to read a file. Download the file example_file.csv found in Moodle (second week). If you can’t download to the hardisk in the lab computer, download directly in your USB key (as per requested in the mandatory material). The file is in the widely used csv format that can be produced by excel among other programs; CSV stands for comma-separated values (CSV). This format is so widely used that even Wikepedia has a page dedicated to it: https://en.wikipedia.org/wiki/Comma-separated_values. R can read several types of file formats, but CSV and text (.txt) are likely the most commonly used formats.
The function read.csv reads a file having an CSV format and saves the data into a variable (below we called this variable my.first.data). The option file.choose() tells R to open a window in which you can simply click and choose the file.
my.first.data<-read.csv(file.choose())
my.first.data
Let’s open the data in a window in which you can see the values in a much better format
View(my.first.data)
Often data are quite large (lot’s of rows) and you want just to observe the first few rows and see the names of variables (columns):
head(my.first.data)
Perhaps the most important distributions in statistics the the normal distribution. During the course we will need to generate normally distributed observations (sample) from a normal population for a given mean and standard deviation. The function rnorm has three main arguments, namely sample size (n), mean (average) and standard deviation (sd). Below, we will generate a sample with 65 observations from a normally distributed population with a mean of 45 and standard deviation of 5.
sampled.values <- rnorm(n=65,mean=45,sd=5)
sampled.values
mean(sampled.values)
sd(sampled.values)
Let’s produce a histogram (frequency distribution) of the sampled values:
hist(sampled.values)
Obviously the sample does not have the same mean and standard deviation as the population due to sampling variation. Think about the frog example seen in the last lecture. Although you sampled from a population with 50% left- and 50% right-handed individuals, some samples had a different ratio of Left/Right than 1 in terms of individuals.
Let’s take now a much larger sample (n = 100000):
sampled.values <- rnorm(n=100000,mean=45,sd=5)
sampled.values
mean(sampled.values)
sd(sampled.values)
Let’s produce a histogram (frequency distribution) of the sampled values:
hist(sampled.values)
How close the sample values (mean and standard deviation) were close? Remember that as the sample size increases, the sample value will likely be closer to the true population values (parameter).
The last thing we will learn today is the use of a computational operation called the “for loop”. Let’s learn how the for loop in R works via a simple operation, i.e., to sum numbers. A for loop repeats a set of commands a number of times. We need to set a counter and set a value in which we want the counter to start and end!
First create a vector containing the numbers 1,2,3,4 and 5
my.small.series <- c(1,2,3,4,5)
Let’s calculate the sum of the observations in the vector above via a simple for loop. The counter below was called myCounter but any name (e.g., MyFirstCounter) would work. The counter was set to start at 1 and end at 5. Note that to calculate the sum of the values in the vector my.small.series, we need to access each value in that series (vector) one by one. The counter does that, i.e., it increases myCounter automatically one unit (1) at the time, so that myCounter will start at 1, then change to 2, then change to 3…and so on until it reaches 5 and the loop then stops because it reached the end value. Loops are extremelly useful!
sum <- 0
for (myCounter in 1:5){
sum<-sum+my.small.series[myCounter]
}
sum
Obviously we can use the R command sum to do the same operation.
sum(my.small.series)
Hopefully you have now a general idea of how a for loop works!
We will learn lots more about R and graphs in the tutorials. For now, you have finished your first tutorial! Don’t hesitate to ask additional questions to your TA.