R is a language and environment for statistical computing and graphics.R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, .) and graphical techniques, and is highly extensible.
This notebook is a tutorial on how to use R.
Before starting to work with R, we need to set the working directory to source file location.
First we will begin with a few basic operations.
We assign values to variables using the assignment operator ‘=’. Another form of assignment, more general, is the ‘<-’ operator. A variable allows you to store values or an object (e.g. a function).
x = 128
y = 16
z <- 5
vars = c(2,4,8,16,32) # This is a vector created using the generic combine function 'c'
x # display value of variable x
## [1] 128
z # displays value of variable z
## [1] 5
vars[1] #This calls the first value in the vector vars
## [1] 2
vars[2] #This calls the second value in the vector vars
## [1] 4
vars[1:3] #This calls the first through third values in the vector vars
## [1] 2 4 8
vars #This calls the vector
## [1] 2 4 8 16 32
Below shows some simple arithmetic operations.
12*6
## [1] 72
128/16
## [1] 8
9^2
## [1] 81
R works with numerous data types. Some of the most basic types are: numeric,integers, logical (Boolean-TRUE/FALSE
) and characters (string-"TEXT"
).
#Type: Character
#Example:"TRUE",'23.4'
v = "TRUE"
class(v)
## [1] "character"
#Type: Numeric
#Example: 12.3,5
v = 23.5
class(v)
## [1] "numeric"
#Type: Logical
#Example: TRUE,FALSE
v = TRUE
class(v)
## [1] "logical"
#Type: Factor (nominal, categorical)
#Example: m f m f m
v = as.factor(c("m", "f", "m"))
class(v)
## [1] "factor"
R Functions are invoked by its name, followed by the parenthesis, and zero or more arguments.
# The following applies the function 'c' (seen earlier) to combine three numeric values into a vector
c(1,2,3)
## [1] 1 2 3
# Example of function mean() to calcule the mean of three values
mean (c(5,6,7))
## [1] 6
# Square root of a number
sqrt(99)
## [1] 9.949874
# Here we are reading a file of type csv (comma seperated values) typical of many Excel files
il_income = read.csv(file = "data/il_income.csv")
top_il_income = read.csv(file = "data/top_il_income.csv")
We can extract values from the dataset to perform calculations.
DuPage = top_il_income$per_capita_income[1]
Lake = top_il_income$per_capita_income[2]
DuPage-Lake
## [1] 472
DuPage+Lake
## [1] 77390
(DuPage+Lake)/2
## [1] 38695
# Repeat the above arithmetic operations using instead McHenry and Sangamon counties
McHenry = top_il_income$per_capita_income[3]
Sangamon = top_il_income$per_capita_income[10]
McHenry-Sangamon
## [1] 2524
McHenry+Sangamon
## [1] 63712
(McHenry+Sangamon)/2
## [1] 31856
## Basic Statistics
mean(il_income$per_capita_income)
## [1] 25164.14
median(il_income$per_capita_income)
## [1] 24808.5
quantile(il_income$per_capita_income)
## 0% 25% 50% 75% 100%
## 14052.00 22666.00 24808.50 26899.75 38931.00
summary(il_income)
## rank county per_capita_income population
## Min. : 1.00 Adams : 1 Min. :14052 Min. : 4135
## 1st Qu.: 26.25 Alexander: 1 1st Qu.:22666 1st Qu.: 14284
## Median : 51.50 Bond : 1 Median :24809 Median : 26610
## Mean : 51.50 Boone : 1 Mean :25164 Mean : 126078
## 3rd Qu.: 76.75 Brown : 1 3rd Qu.:26900 3rd Qu.: 53319
## Max. :102.00 Bureau : 1 Max. :38931 Max. :5238216
## (Other) :96
## region
## Min. :1.000
## 1st Qu.:3.000
## Median :4.000
## Mean :3.735
## 3rd Qu.:5.000
## Max. :5.000
##
mean(top_il_income$per_capita_income)
## [1] 32918.5
median(top_il_income$per_capita_income)
## [1] 31430
quantile(top_il_income$per_capita_income)
## 0% 25% 50% 75% 100%
## 30594.00 30743.75 31430.00 33103.25 38931.00
# Repeat the basic statistics here using instead the data from the file top_il_income
summary(top_il_income)
## rank county per_capita_income population
## Min. : 2.00 DuPage :1 Min. :30594 Min. : 7032
## 1st Qu.: 4.25 Kane :1 1st Qu.:30744 1st Qu.: 36921
## Median :12.00 Kendall:1 Median :31430 Median :194782
## Mean :27.10 Lake :1 Mean :32919 Mean :334866
## 3rd Qu.:41.00 McHenry:1 3rd Qu.:33103 3rd Qu.:648159
## Max. :90.00 McLean :1 Max. :38931 Max. :933736
## (Other):4
## region
## Min. :2.0
## 1st Qu.:2.0
## Median :3.0
## Mean :3.2
## 3rd Qu.:4.0
## Max. :5.0
##
# Vectors
## Defining a Vector
##A sequence of data elements of the same basic type is defined as a vector.
# vector of numeric values
c(2, 3, 5, 8)
## [1] 2 3 5 8
# vector of logical values.
c(TRUE, FALSE, TRUE)
## [1] TRUE FALSE TRUE
# vector of character strings.
c("A", "B", "B-", "C", "D")
## [1] "A" "B" "B-" "C" "D"
Lists, as opposed to vectors, can hold components of different types.
scores = c(80, 75, 55) # vector of numeric values
grades = c("B", "C", "D-") # vector of character strings.
office_hours = c(TRUE, FALSE, FALSE) # vector of logical values.
student = list(scores,grades,office_hours) # list of vectors
student
## [[1]]
## [1] 80 75 55
##
## [[2]]
## [1] "B" "C" "D-"
##
## [[3]]
## [1] TRUE FALSE FALSE
We can retrieve components of the list with the single square bracket []
operator.
student[1]
## [[1]]
## [1] 80 75 55
student[2]
## [[1]]
## [1] "B" "C" "D-"
student[3]
## [[1]]
## [1] TRUE FALSE FALSE
# first two components of the list
student[1:2]
## [[1]]
## [1] 80 75 55
##
## [[2]]
## [1] "B" "C" "D-"
Using the double square bracket [[]]
operator we can reference a member of the list directly.Using [] will still reference the list but will not allow to extract particular member of the list
student[[1]] # Components of the Scores Vector
## [1] 80 75 55
First element of the Scores vector (i.e. the first member of the list)
student[[1]][1]
## [1] 80
First three elements of the Scores vector(i.e. the first member of the list)
student[[1]][1:3]
## [1] 80 75 55
It’s possible to assign names to list members and reference them by names instead of by numeric indexes.
student = list(myscores = scores, mygrades = grades, myoffice_hours = office_hours)
student
## $myscores
## [1] 80 75 55
##
## $mygrades
## [1] "B" "C" "D-"
##
## $myoffice_hours
## [1] TRUE FALSE FALSE
student$myscores
## [1] 80 75 55
student$mygrades
## [1] "B" "C" "D-"
student$myoffice_hours
## [1] TRUE FALSE FALSE
All columns in a matrix must have the same data type and the same length.
Create a numeric matrix of 5 rows and 4 columns made of sequential numbers 1:20
x_mat = matrix(1:20, nrow=5, ncol=4)
x_mat
## [,1] [,2] [,3] [,4]
## [1,] 1 6 11 16
## [2,] 2 7 12 17
## [3,] 3 8 13 18
## [4,] 4 9 14 19
## [5,] 5 10 15 20
Retrieve the 4th column of matrix
x_mat[,4]
## [1] 16 17 18 19 20
Retrieve the 3rd row of matrix
x_mat[3,]
## [1] 3 8 13 18
Retrieve rows 2,3,4 of columns 1,2,3
x_mat[2:4,1:3]
## [,1] [,2] [,3]
## [1,] 2 7 12
## [2,] 3 8 13
## [3,] 4 9 14
A data frame is more general than a matrix, in that different columns can have different data types (numeric, character, logic, factor). It is a powerful way to work with mixed data structures.
When we need to store data in table form, we use data frames, which are created by combining lists of vectors of equal length. The variables of a data set are the columns and the observations are the rows.
The str()
function helps us to display the internal structure of any R data structure or object to make sure that it’s correct.
str(il_income)
## 'data.frame': 102 obs. of 5 variables:
## $ rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ county : Factor w/ 102 levels "Adams","Alexander",..: 16 22 49 99 45 60 101 64 86 10 ...
## $ per_capita_income: int 30468 38931 38459 30791 30645 23937 24802 30728 23279 26087 ...
## $ population : int 5238216 933736 703910 687263 530847 307343 287078 266209 264052 208861 ...
## $ region : int 1 2 2 2 2 2 2 5 5 3 ...
Snapshot of the solar system.
name = c("Earth", "Mars", "Jupiter")
type = c("Terrestrial","Terrestrial", "Gas giant")
diameter = c(1, 0.532, 11.209)
rotation = c(1, 1.03, 0.41)
rings = c(FALSE, FALSE, TRUE)
Now, by combining the vectors of equal size, we can create a data frame object.
planets_df = data.frame(name,type,diameter,rotation,rings)
planets_df
## name type diameter rotation rings
## 1 Earth Terrestrial 1.000 1.00 FALSE
## 2 Mars Terrestrial 0.532 1.03 FALSE
## 3 Jupiter Gas giant 11.209 0.41 TRUE
Datacamp - Learn Data Science from your browser: https://www.datacamp.com/courses/free-introduction-to-r
R-tutor - An R intro to stats that explains basic R concepts: http://www.r-tutor.com/r-introduction
Data samples used in this worksheet were downloaded from the U.S. Census Bureau American FactFinder site.