R is a language and environment for statistical computing and graphics.R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, .) and graphical techniques, and is highly extensible.
This notebook is a tutorial on how to use R.
Before starting to work with R, we need to set the working directory to source file location.
First we will begin with a few basic operations.
We assign values to variables using the assignment operator ‘=’. Another form of assignment, more general, is the ‘<-’ operator. A variable allows you to store values or an object (e.g. a function). Note: R is case sensitive. X and x are treated as two different variable names.
x = 128 # Here x is the variable name and 128 is the value assigned for x
y = 16
z <- 5
vars = c(2,4,8,16,32) # This is a vector named vars which is created using the generic combine function 'c'
# TASK 1: Assign a value 19 to a variable named w
w = 19
x # This calls variable x and displays its value.
## [1] 128
z # This calls variable z and displays its value.
## [1] 5
# TASK 2: Call variable w and display its value
w #
## [1] 19
vars[1] # This calls vector vars and displays its first content.
## [1] 2
vars[2] # This calls vector vars and displays its second content.
## [1] 4
vars[1:3] # This calls vector vars and displays its first through third content.
## [1] 2 4 8
vars # This calls vector vars and displays all of its content.
## [1] 2 4 8 16 32
# TASK 3: Call vector vars and display its third through fifth content
vars[3:5]
## [1] 8 16 32
Below shows some simple arithmetic operations.
12*6 # This operation multiplies 12 and 6. Multiplication use symbol *
## [1] 72
128/16 # This is a division operation.
## [1] 8
9^2 # This is how to compute 9 to the power of 2.
## [1] 81
# TASK 4: Compute 3 to the power of 3
3^3
## [1] 27
R works with numerous data types. Some of the most basic types are: character (string-"TEXT"), numeric, logical (Boolean-TRUE/FALSE), and factor. These types are also called classes. Therefore, in R there different classes for different data types; e.g. class character holds data of type character, class numeric holds data of type numeric, etc. A variable’s class depends on the type of the data that the variable holds
#The following is a variable v which's assigned a value of type: Character.
#Example of data values with type character:"TRUE",'23.4'
#Note: character type of data can be written with double or single quote
v = "TRUE" # This assigns value "TRUE" of type character to variable v
class(v) # This calls and displays the class which variable v belongs to
## [1] "character"
#The following is assigning variable v a data of type: numeric
#Example: 12.3,5
v = 23.5
class(v)
## [1] "numeric"
#The following is assigning variable v a data of type: Logical
#Example: TRUE,FALSE
v = TRUE # Note: the different between "TRUE" and TRUE (without quotes)
class(v)
## [1] "logical"
#The following is type: Factor (nominal, categorical)
#Example: m f m f m
v = as.factor(c("m", "f", "m"))
class(v)
## [1] "factor"
R Functions are invoked or called by its name, followed by the parenthesis, and zero or more arguments.
# The following calls the function 'c' (seen earlier) to combine three numeric values into a vector
c(1,2,3)
## [1] 1 2 3
# This calls the function mean() to calcule the mean of three values
mean(c(5,6,7))
## [1] 6
# This calls function sqrt() to calculate the qquare root of a number
sqrt(16)
## [1] 4
# TASK 5: Combine four numeric values (2,19,11) into a vector using function c() AND call the function mean() to calculate the mean of the four values
mean(c(2,19,11))
## [1] 10.66667
The following shows how to read a file called top_il_income.csv which is of type csv (comma seperated values), and assign a variable called top_il_income for that file. Note: You can name the variable using other names but it’s important to give a meaningful name to a variable.
top_il_income = read.csv(file = "data/top_il_income.csv")
# TASK 6: Read a file called il_income.csv and assign a variable for that file called il_income
il_income = read.csv(file = "data/il_income.csv")
We can extract values from the dataset to perform calculations. Below we perform subtraction, addition, and division operations on the data kept in variable top_il_income. The $per_capita_income indicates the name of the field (column) in the data file which is per_capita_income, and the index [1] indicates the first data in that column.
FirstIncome = top_il_income$per_capita_income[1] # This assigns variable FirstIncome the first value under per_capita_income field in file top_il_income
SecondIncome = top_il_income$per_capita_income[2]
FirstIncome-SecondIncome
## [1] 472
FirstIncome+SecondIncome
## [1] 77390
(FirstIncome+SecondIncome)/2
## [1] 38695
# TASK 7: Repeat the above arithmetic operations using instead the $per_capita_income field from file il_income
FirstIncome = il_income$per_capita_income[1]
SecondIncome = il_income$per_capita_income[2]
FirstIncome-SecondIncome
## [1] -8463
FirstIncome+SecondIncome
## [1] 69399
(FirstIncome+SecondIncome)/2
## [1] 34699.5
# Basic Statistics
mean(top_il_income$per_capita_income) # This computes the mean of all the data under field per_capita_income in file top_il_income
## [1] 32918.5
median(top_il_income$per_capita_income)
## [1] 31430
quantile(top_il_income$per_capita_income)
## 0% 25% 50% 75% 100%
## 30594.00 30743.75 31430.00 33103.25 38931.00
summary(top_il_income)
## rank county per_capita_income population
## Min. : 2.00 DuPage :1 Min. :30594 Min. : 7032
## 1st Qu.: 4.25 Kane :1 1st Qu.:30744 1st Qu.: 36920
## Median :12.00 Kendall:1 Median :31430 Median :194782
## Mean :27.10 Lake :1 Mean :32918 Mean :334866
## 3rd Qu.:41.00 McHenry:1 3rd Qu.:33103 3rd Qu.:648159
## Max. :90.00 McLean :1 Max. :38931 Max. :933736
## (Other):4
## region
## Min. :2.0
## 1st Qu.:2.0
## Median :3.0
## Mean :3.2
## 3rd Qu.:4.0
## Max. :5.0
##
# TASK 8: Repeat the basic statistics (i.e. to compute the mean, median, and quantile) using instead the data under field per_capita_income in file il_income
mean(il_income$per_capita_income)
## [1] 25164.14
median(il_income$per_capita_income)
## [1] 24808.5
quantile(il_income$per_capita_income)
## 0% 25% 50% 75% 100%
## 14052.00 22666.00 24808.50 26899.75 38931.00
summary(il_income)
## rank county per_capita_income population
## Min. : 1.00 Adams : 1 Min. :14052 Min. : 4135
## 1st Qu.: 26.25 Alexander: 1 1st Qu.:22666 1st Qu.: 14284
## Median : 51.50 Bond : 1 Median :24808 Median : 26610
## Mean : 51.50 Boone : 1 Mean :25164 Mean : 126078
## 3rd Qu.: 76.75 Brown : 1 3rd Qu.:26900 3rd Qu.: 53319
## Max. :102.00 Bureau : 1 Max. :38931 Max. :5238216
## (Other) :96
## region
## Min. :1.000
## 1st Qu.:3.000
## Median :4.000
## Mean :3.735
## 3rd Qu.:5.000
## Max. :5.000
##
A sequence of data elements of the same basic type is defined as a vector.
c(2, 3, 5, 8) # This creates a vector of numeric values.
## [1] 2 3 5 8
c(TRUE, FALSE, TRUE) # This creates a vector of logical values.
## [1] TRUE FALSE TRUE
c("A", "B", "B-", "C", "D") # This creates a vector of character strins values.
## [1] "A" "B" "B-" "C" "D"
Lists, as opposed to vectors, can hold components of different types.
scores = c(80, 75, 55) # This creates a vector variable called scores and assign numeric values to it.
grades = c("B", "C", "D-") # This creates a vector variable called grades and assign character strings values to it.
office_hours = c(TRUE, FALSE, FALSE) # This creates a vector variable called office_hours and assign logical values to it.
student = list(scores,grades,office_hours) # This creates a list variable called student and make scores, grades and office_hours vectors as the list component.
student # This calls and displays the content of the student list.
## [[1]]
## [1] 80 75 55
##
## [[2]]
## [1] "B" "C" "D-"
##
## [[3]]
## [1] TRUE FALSE FALSE
We can retrieve components of the list with the single square bracket [] operator.
student[1] # This calls and displays first component of the student list.
## [[1]]
## [1] 80 75 55
student[2] # This calls and displays second component of the student list.
## [[1]]
## [1] "B" "C" "D-"
student[3]
## [[1]]
## [1] TRUE FALSE FALSE
student[1:2] # This calls and displays first two components of the student list.
## [[1]]
## [1] 80 75 55
##
## [[2]]
## [1] "B" "C" "D-"
Using the double square bracket [[]] operator we can reference a member of the list directly. Using one bracket [] would still reference the list but will not allow you to extract a particular member of the list.
student[[1]] # This calls and displays the components of the Scores Vector in student list.
## [1] 80 75 55
student[[1]][1] # This calls and displays first element of scores vector in student list.
## [1] 80
student[[2]][3] #This calls and displays third element of grades vector in student list.
## [1] "D-"
# TASK 9: Call and display the second element of office_hours vector in student list
office_hours [[2]][2]
## [1] NA
student[[1]][1:3] # This calls and displays first three elements of scores vector in student list.
## [1] 80 75 55
It’s possible to assign names to list members and reference them by names instead of by numeric indexes.
student = list(myscores = scores, mygrades = grades , myoffice_hours = office_hours)
student
## $myscores
## [1] 80 75 55
##
## $mygrades
## [1] "B" "C" "D-"
##
## $myoffice_hours
## [1] TRUE FALSE FALSE
student$myscores
## [1] 80 75 55
student$mygrades
## [1] "B" "C" "D-"
student$myoffice_hours
## [1] TRUE FALSE FALSE
All columns in a matrix must have the same data type and the same length. This following creates a numeric matrix called x_mat which consists of 5 rows and 4 columns made of sequential numbers 1 to 20.
x_mat = matrix(1:20, nrow=5, ncol=4)
x_mat[,4] # This retrieves and displays the 4th column of the matrix.
## [1] 16 17 18 19 20
x_mat[3,] # This retrieves and displays the 3rd row of the matrix.
## [1] 3 8 13 18
x_mat[2:4,1:3] # This retrieves and displays rows 2,3,4 of columns 1,2,3 of the matrix.
## [,1] [,2] [,3]
## [1,] 2 7 12
## [2,] 3 8 13
## [3,] 4 9 14
# TASK 10: Retrieve and display rows 2,3,4 of columns 1,2,3,4 of the matrix
x_mat[2:4,1:4]
## [,1] [,2] [,3] [,4]
## [1,] 2 7 12 17
## [2,] 3 8 13 18
## [3,] 4 9 14 19
A data frame is more general than a matrix, in that different columns can have different data types (numeric, character, logic, factor). It is a powerful way to work with mixed data structures.
When we need to store data in table form, we use data frames, which are created by combining lists of vectors of equal length. The variables of a data set are the columns and the observations are the rows.
The str() function helps us to display the internal structure of any R data structure or object to make sure that it’s correct.
str(top_il_income)
## 'data.frame': 10 obs. of 5 variables:
## $ rank : int 2 3 32 44 67 16 4 8 5 90
## $ county : Factor w/ 10 levels "DuPage","Kane",..: 1 4 5 7 8 3 10 6 2 9
## $ per_capita_income: int 38931 38459 33118 33059 31750 31110 30791 30728 30645 30594
## $ population : int 933736 703910 46045 33879 16387 123355 687263 266209 530847 7032
## $ region : int 2 2 4 5 4 2 2 5 2 4
Snapshot of the solar system.
name = c("Earth", "Mars", "Jupiter")
type = c("Terrestrial","Terrestrial", "Gas giant")
diameter = c(1, 0.532, 11.209)
rotation = c(1, 1.03, 0.41)
rings = c(FALSE, FALSE, TRUE)
Now, by combining the vectors of equal size, we can create a data frame object.
planets_df = data.frame(name,type,diameter,rotation,rings)
planets_df
## name type diameter rotation rings
## 1 Earth Terrestrial 1.000 1.00 FALSE
## 2 Mars Terrestrial 0.532 1.03 FALSE
## 3 Jupiter Gas giant 11.209 0.41 TRUE
Datacamp - Learn Data Science from your browser: https://www.datacamp.com/courses/free-introduction-to-r
R-tutor - An R intro to stats that explains basic R concepts: http://www.r-tutor.com/r-introduction
Data samples used in this worksheet were downloaded from the U.S. Census Bureau American FactFinder site. * “SELECTED ECONOMIC CHARACTERISTICS 2006-2010 American Community Survey 5-Year Estimates” - U.S. Census Bureau. Retrieved 2016-09-09: https://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml