R is a language and environment for statistical computing and graphics.R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, .) and graphical techniques, and is highly extensible.
This notebook is a tutorial on how to use R.
Before starting to work with R, we need to set the working directory to source file location.
First we will begin with a few basic operations.
We assign values to variables using the assignment operator ‘=’. Another form of assignment, more general, is the ‘<-’ operator. A variable allows you to store values or an object (e.g. a function). Note: R is case sensitive. X and x are treated as two different variable names.
x = 1000 # Here x is the variable name and 1000 is the value assigned for variable x.
number = 3 # Here number is the variable name and 3 is the value assigned for variable number.
zz <- 20 # Here zz is the variable name and 20 is the value assigned for zz.
vec = c(500,100,10,1) # This is a vector named vec which is created using the generic combine function 'c'
myvar = 99999
x # This calls variable x and displays its value.
## [1] 1000
zz # This calls variable zz and displays its value.
## [1] 20
myvar
## [1] 99999
vec # This calls vector vec and displays all of its content.
## [1] 500 100 10 1
vec[2] # This calls vector vec and displays its second content.
## [1] 100
vec[1:3] # This calls vector vec and displays its first through third content.
## [1] 500 100 10
vec[2:4]
## [1] 100 10 1
Below shows some simple arithmetic operations.
60+47 # This operation adds 60 and 47.
## [1] 107
10-7 # This operation deducts 7 from 10.
## [1] 3
2*3 # This operation multiplies 2 and 3. Multiplication use symbol *
## [1] 6
2000/2 # This is a division operation.
## [1] 1000
4^2 # This is how to compute 4 to the power of 2.
## [1] 16
2^3
## [1] 8
R works with numerous data types. Some of the most basic types are: character (string-"TEXT"), numeric, logical (Boolean-TRUE/FALSE), and factor. These types are also called classes. Therefore, in R there different classes for different data types; e.g. class character holds data of type character, class numeric holds data of type numeric, etc. A variable’s class depends on the type of the data that the variable holds
#The following is a variable v which's assigned a value of type: Character; for example the data #values with type character:"TRUE",'23.4'
#Note: character type of data can be written with double or single quote.
v = "TRUE" # This assigns value "TRUE" of type character to variable v.
class(v) # This calls and displays the class which variable v belongs to.
## [1] "character"
#The following is assigning variable v a data of type: numeric; for example: 12.3, 5
w = 23.5 # This assigns value 23.5 of type numeric to variable w.
class(w) # This calls and displays the class which variable w belongs to.
## [1] "numeric"
#The following is assigning variable t a data of type: Logical; for example: TRUE, FALSE.
t = TRUE # This assigns value TRUE of type logical to variable t.
# Note: the different between "TRUE" and TRUE (without quotes).
class(t) # This calls and displays the class which variable t belongs to.
## [1] "logical"
#The following is type: Factor (nominal, categorical); for example: m f m f m
u = as.factor(c("m", "f", "m"))
class(u)
## [1] "factor"
R Functions are invoked or called by its name, followed by the parenthesis, and zero or more arguments.
# The following calls the function 'c' (seen earlier) to combine three numeric values into a vector.
c(1,2,3)
## [1] 1 2 3
# This calls the function mean() to calcule the mean of three values.
mean(c(10, 50, 3))
## [1] 21
# This calls function sqrt() to calculate the square root of a number.
sqrt(9)
## [1] 3
mean(c(100,40,70))
## [1] 70
The following code shows how to read a file called top_il_income.csv which is of type csv (comma seperated values), and assign a variable called topIncome for that file. Note: You can name the variable using other names but it’s important to give a meaningful name to a variable.
topIncome = read.csv(file = "data/top_il_income.csv")
illinoisIncome = read.csv(file = "data/il_income.csv")
We can extract values from the dataset to perform calculations. Below we perform subtraction, addition, and division operations on the data kept in variable topIncome. The $per_capita_income indicates the name of the field (column) in the data file which is per_capita_income, and the index [1] indicates the first data in that column.
firstIncome = topIncome$per_capita_income[1]
# The above code assigns variable firstIncome the first value under per_capita_income field in file topIncome.
secondIncome = topIncome$per_capita_income[2]
# The above code assigns variable secondIncome the first value under per_capita_income field in file topIncome.
firstIncome-secondIncome
## [1] 472
# The above code subtracts the value of secondIncome from the value of firstIncome.
firstIncome+secondIncome
## [1] 77390
(firstIncome+secondIncome)/2
## [1] 38695
firstIncome = illinoisIncome$per_capita_income[1]
secondIncome = illinoisIncome$per_capita_income[2]
firstIncome-secondIncome
## [1] -8463
firstIncome+secondIncome
## [1] 69399
(firstIncome+secondIncome)/2
## [1] 34699.5
Below shows how to do some basic statistics on the data in file topIncome.
mean(topIncome$per_capita_income)
## [1] 32918.5
# The above code computes the mean of all the data under field per_capita_income in file topIncome
median(topIncome$per_capita_income) # Computes the median
## [1] 31430
quantile(topIncome$per_capita_income) # Computes the quantile
## 0% 25% 50% 75% 100%
## 30594.00 30743.75 31430.00 33103.25 38931.00
summary(topIncome) # Generate the summary
## rank county per_capita_income population
## Min. : 2.00 DuPage :1 Min. :30594 Min. : 7032
## 1st Qu.: 4.25 Kane :1 1st Qu.:30744 1st Qu.: 36920
## Median :12.00 Kendall:1 Median :31430 Median :194782
## Mean :27.10 Lake :1 Mean :32918 Mean :334866
## 3rd Qu.:41.00 McHenry:1 3rd Qu.:33103 3rd Qu.:648159
## Max. :90.00 McLean :1 Max. :38931 Max. :933736
## (Other):4
## region
## Min. :2.0
## 1st Qu.:2.0
## Median :3.0
## Mean :3.2
## 3rd Qu.:4.0
## Max. :5.0
##
mean(illinoisIncome$per_capita_income)
## [1] 25164.14
median(illinoisIncome$per_capita_income)
## [1] 24808.5
quantile(illinoisIncome$per_capita_income)
## 0% 25% 50% 75% 100%
## 14052.00 22666.00 24808.50 26899.75 38931.00
summary(illinoisIncome)
## rank county per_capita_income population
## Min. : 1.00 Adams : 1 Min. :14052 Min. : 4135
## 1st Qu.: 26.25 Alexander: 1 1st Qu.:22666 1st Qu.: 14284
## Median : 51.50 Bond : 1 Median :24808 Median : 26610
## Mean : 51.50 Boone : 1 Mean :25164 Mean : 126078
## 3rd Qu.: 76.75 Brown : 1 3rd Qu.:26900 3rd Qu.: 53319
## Max. :102.00 Bureau : 1 Max. :38931 Max. :5238216
## (Other) :96
## region
## Min. :1.000
## 1st Qu.:3.000
## Median :4.000
## Mean :3.735
## 3rd Qu.:5.000
## Max. :5.000
##
A sequence of data elements of the same basic type is defined as a vector.
c(2, 3, 5, 8) # This creates a vector of numeric values.
## [1] 2 3 5 8
c(TRUE, FALSE, TRUE) # This creates a vector of logical values.
## [1] TRUE FALSE TRUE
c("A", "B", "B-", "C", "D") # This creates a vector of character strings values.
## [1] "A" "B" "B-" "C" "D"
Lists, as opposed to vectors, can hold components of different types.
scores = c(80, 75, 55)
# The above code creates a vector variable called scores and assigns numeric values to it.
grades = c("B", "C", "D-")
#The above code creates a vector variable called grades & assigns character strings values to it.
office_hours = c(TRUE, FALSE, FALSE)
# The above code creates a vector variable called office_hours and assigns logical values to it.
student = list(scores, grades, office_hours)
# The above code creates a list variable called student which contais three vectors: scores, grades and office_hours.
student # This calls and displays the content of the student list.
## [[1]]
## [1] 80 75 55
##
## [[2]]
## [1] "B" "C" "D-"
##
## [[3]]
## [1] TRUE FALSE FALSE
We can retrieve components (or the vectos) of the list with the single square bracket [] operator.
student[1] # This calls and displays first vector of the student list which is scores.
## [[1]]
## [1] 80 75 55
student[2] # This calls and displays second vector of the student list which is grades.
## [[1]]
## [1] "B" "C" "D-"
# How to call and display the third vector of the student list?
student[2:3] # This calls and displays the second to the third vectors of the student list which are .....
## [[1]]
## [1] "B" "C" "D-"
##
## [[2]]
## [1] TRUE FALSE FALSE
Using the double square bracket [[]] operator we can reference a member of the list directly. Using one bracket [] would still reference the list but will not allow you to extract a particular member of the list.
student[[1]] # This calls and displays the members of the scores vector in student list.
## [1] 80 75 55
student[[1]][3] # This calls and displays the third member of the first vector in student list which is the third member of scores.
## [1] 55
student[[2]][1] # This calls and displays the first member of the second vector in student list which is the first member of grades.
## [1] "B"
student[[3]][2]
## [1] FALSE
student[[1]][1:3] # This calls and displays first three members of scores vector in student list.
## [1] 80 75 55
It’s possible to assign names to list members and reference them by names instead of by numeric indexes.
student = list(myscores = scores, mygrades = grades , myoffice_hours = office_hours)
student
## $myscores
## [1] 80 75 55
##
## $mygrades
## [1] "B" "C" "D-"
##
## $myoffice_hours
## [1] TRUE FALSE FALSE
student$myscores
## [1] 80 75 55
student$mygrades
## [1] "B" "C" "D-"
student$myoffice_hours
## [1] TRUE FALSE FALSE
All columns in a matrix must have the same data type and the same length. This following creates a numeric matrix called my_mat which consists of 4 rows and 5 columns made of sequential numbers 1 to 20.
my_mat = matrix(1:20, nrow=4, ncol=5)
my_mat # This retrieves and displays all content of the the matrix.
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 5 9 13 17
## [2,] 2 6 10 14 18
## [3,] 3 7 11 15 19
## [4,] 4 8 12 16 20
my_mat[,4] # This retrieves and displays the 4th column of the matrix.
## [1] 13 14 15 16
my_mat[3,] # This retrieves and displays the 3rd row of the matrix.
## [1] 3 7 11 15 19
my_mat[2:4,1:3] # This retrieves and displays rows 2,3,4 of columns 1,2,3 of the matrix.
## [,1] [,2] [,3]
## [1,] 2 6 10
## [2,] 3 7 11
## [3,] 4 8 12
my_mat[3:4,1:3]
## [,1] [,2] [,3]
## [1,] 3 7 11
## [2,] 4 8 12
A data frame is more general than a matrix, in that different columns can have different data types (numeric, character, logic, factor). It is a powerful way to work with mixed data structures.
When we need to store data in table form, we use data frames, which are created by combining lists of vectors of equal length. The variables of a data set are the columns and the observations are the rows.
The str() function helps us to display the internal structure of any R data structure or object to make sure that it’s correct.
str(topIncome)
## 'data.frame': 10 obs. of 5 variables:
## $ rank : int 2 3 32 44 67 16 4 8 5 90
## $ county : Factor w/ 10 levels "DuPage","Kane",..: 1 4 5 7 8 3 10 6 2 9
## $ per_capita_income: int 38931 38459 33118 33059 31750 31110 30791 30728 30645 30594
## $ population : int 933736 703910 46045 33879 16387 123355 687263 266209 530847 7032
## $ region : int 2 2 4 5 4 2 2 5 2 4
Snapshot of the solar system.
name = c("Earth", "Mars", "Jupiter")
type = c("Terrestrial","Terrestrial", "Gas giant")
diameter = c(1, 0.532, 11.209)
rotation = c(1, 1.03, 0.41)
rings = c(FALSE, FALSE, TRUE)
Now, by combining the vectors of equal size, we can create a data frame object.
planets_df = data.frame(name,type,diameter,rotation,rings)
planets_df
## name type diameter rotation rings
## 1 Earth Terrestrial 1.000 1.00 FALSE
## 2 Mars Terrestrial 0.532 1.03 FALSE
## 3 Jupiter Gas giant 11.209 0.41 TRUE
Datacamp - Learn Data Science from your browser: https://www.datacamp.com/courses/free-introduction-to-r
R-tutor - An R intro to stats that explains basic R concepts: http://www.r-tutor.com/r-introduction
Data samples used in this worksheet were downloaded from the U.S. Census Bureau American FactFinder site. * “SELECTED ECONOMIC CHARACTERISTICS 2006-2010 American Community Survey 5-Year Estimates” - U.S. Census Bureau. Retrieved 2016-09-09: https://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml