After you install base R, you can extend the functionality by installing packages from within R itself. For example, the package that was customized for this course, known as dslabs, is installed using:
install.packages ("dslabs")
Alternatively, you could use the RStudio environment, Tools > Install Packages, permitting autocomplete.
Once installed, you can then load the package into the current R session using the library function. Loading a package makes the functionality available for the session using the library:
library (dslabs)
install.packages("dslabs") # to install a single package
install.packages("dslabs", dependencies = TRUE) #installs both the package and its dependencies
install.packages(c("tidyverse", "dslabs")) # to install two packages at the same time
installed.packages() # to see the list of all installed packages
Once a package has been installed, it is technically layered ON TO R, which is why packages must be re-installed when R is updated.
<- is the assignment symbol in R (looks like a little arrow). A VERY handy shortcut is Alt+minus on PC, or Option+- on a MAC.
You can use the = sign as well, but may cause confusion (recall learning the G chord twice on guitar).
a <- 1
b <- 1
c <- -1
At any point in time, you can run the ls() command to list all the current variables and their values in your current workspace, or you can view the Global Environment variables in the Environment tab in RStudio (quick Ctrl+8)
ls()
## [1] "a" "b" "c" "Titanic"
While variables are examples of objects, objects can store much more complex information.
Variable names in R MUST start with a letter and CANNOT contain spaces. You should avoid using names of prebuilt or preexisting R constructs. Use meaningful words that describe what is stored, stick to lowercase and use underscores instead of spaces.
“Once you define variables, a data analysis process can usually be described as a series of functions applied to the data” - Rafael A. Irizarry
If you type the function name without the following parentheses, R will show you the code for the ls function- it won’t evaluate it. Most functions require an argument(s) and functions can be nested.
You can at any time use the help() function, passing in the function name, to get details on the function including description, usage and argument. Alternatively, you can use the syntax ?log shorthand. You can also inquire what args are expected of a function using args():
help("log")
?log
args(log)
## function (x, base = exp(1))
## NULL
To specify arguments, we use the equals sign. If no argument name is used, R assumes you’re entering arguments in the order shown in the help file.
There are some built in functions in R to leverage very commonly required operations. While functions need parentheses to be evaluated, there are exceptions- such as the arithmetic and relational operators:
help("+") #shows arithmetic operators
help(">") #shows relational operators
There are also a host of ‘helper’ functions to access other useful features:
sqrt(144) #calculates the square root of the provided argument
data() #shows the currently available data sets
Titanic #displays the datatset
Titanic %>% as_tibble() #displays tibble of dataset (requires dplyr library)
pi
Inf+1
The function class() helps us determine the type of an object.
myVariable <- 1
myCharString <- "character string"
myNumericVector <- c(1,2,3)
myCharVector <- c("a","b","c")
myLogicalVector <- c(TRUE, TRUE, FALSE)
myFactor <- factor(c("Single", "Married", "Married", "Single"))
class(myVariable)
## [1] "numeric"
class(myCharString)
## [1] "character"
class(ls) #the function
## [1] "function"
class(myNumericVector) #numeric vector
## [1] "numeric"
class(myCharVector) #character vector
## [1] "character"
class(myLogicalVector) #logical vector
## [1] "logical"
class(murders) #the data frame
## [1] "data.frame"
class(myFactor) #factor
## [1] "factor"
A vector is an object consisting of several entries and can be a numeric vector, a character vector, or a logical vector.
Data frames can be thought of as tables with rows representing observations and columns representing different variables. To access data from columns of a data frame, we use the dollar sign symbol, $, which is called the accessor. A data frame is a list of vectors of equal length.
Data frames are the most common form of data repository used in R.
The function factor is used to encode a vector as a factor (the terms ‘category’ and ‘enumerated type’ are also used for factors). In earlier versions of R, storing character data as a factor was more space efficient if there is even a small proportion of repeats. However, identical character strings now share storage, so the difference is small in most cases. Storing categories as integers is more efficient. Factors are sometimes required to fit statistical models that depend on categorical data.
You can obtain structural information about a data frame by using the str() (structure) function:
str(murders) #returns the structure of a data frame
## 'data.frame': 51 obs. of 5 variables:
## $ state : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ abb : chr "AL" "AK" "AZ" "AR" ...
## $ region : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
## $ population: num 4779736 710231 6392017 2915918 37253956 ...
## $ total : num 135 19 232 93 1257 ...
str(myFactor) #returns the structure of a factor
## Factor w/ 2 levels "Married","Single": 2 1 1 2
Comparing Vector, Array, List and Data Frame in R
The most basic unit available in R to store data are vectors. In tabular form, a vector is analogous to a column. A vector is a series of values which are all of the same type. Vectors can be populated with numerics (2,4,6000), or characters (“one”,“two”,“three”), or variables (a,b,c).
When creating vectors, it is often useful to combine, or concatenate, multiple items into the vector object. For this, a common tool is the c() function:
codes <- c(380, 124, 818)
country <- c("italy", "canada", "egypt")
codesLabeledWithoutQuotes <- c(italy = 380, canada = 124, egypt = 818) #same result with or without quotes
codesLabeledWithQuotes <- c("italy" = 380, "canada" = 124, "egypt" = 818) #same result with or without quotes
Provided the vector was created with labels, the names() function will return them. For example:
names(codesLabeledWithQuotes)
## [1] "italy" "canada" "egypt"
If however, the vector was not created with labels, you will receive a NULL response when executing the names function:
names(codes)
## NULL
Additionally, if the entries of a vector are named, they may be accessed by referring to their name:
codesLabeledWithQuotes["canada"]
## canada
## 124
codesLabeledWithQuotes[c("egypt","italy")]
## egypt italy
## 818 380
Vectors can be combined - but only if they are of equal length:
names(codes)<-country
codes
## italy canada egypt
## 380 124 818
Another common operation on vectors is creating sequences or slicing the data using sequence arithmetic. The basic seq() function works like this:
seq(1,10,1) #creates a sequence from 1 to 10 by increments of 1
## [1] 1 2 3 4 5 6 7 8 9 10
seq(1,10,2) #creates a sequence from 1 to 10 by increments of 2
## [1] 1 3 5 7 9
1:10 #shortcut creates a sequence from 1 through 10, but does not permit increment variability
## [1] 1 2 3 4 5 6 7 8 9 10
Using the seq() function can permit us to slice sections of the vector, say the first 10 items (head) or say the last 10 items (tail), or return all even/odd elements.
When subsetting a dataset, use the square brackets [ ] to access a specified element, either individually or sequentially:
codes[1] #returns the first element in the vector
## italy
## 380
codes[c(1,3)] #returns the first AND third elements in the vector
## italy egypt
## 380 818
codes[1:3] #returns a sequence of the first THROUGH third elements of a vector
## italy canada egypt
## 380 124 818
There are a number of functions that can be leveraged when handling vectors.
max(murders$total) #returns the maximum numeric OR
## [1] 1257
max(myCharVector) #returns the maximum character - the character closes to z
## [1] "c"
min(murders$total) #returns the minimum numeric OR
## [1] 2
min(myCharVector) #returns the minimum character - the character closes to a
## [1] "a"
which.max(murders$total) #returns the specific index value of the maximum numeric or character (a utility)
## [1] 5
There will be times when you will want to answer a question - “What is the gender of the oldest male in the data” or “What value(s) of one variable are associated with the value of another”. Here we can leverage the which() function:
murders$state[which.max(murders$total)] #returns the state, located at the index of the maximum numeric in the total column
## [1] "California"
murders$state[which.min(murders$total)] #returns the state, located at the index of the maximum numeric in the total column
## [1] "Vermont"
#returns the Sex (gender), located at the index of the maximum numeric in the Age column
data("heights")
heights$sex[which.max(heights$height)]
## [1] Male
## Levels: Female Male
In general, coercion - or typecasting- is an attempt by R to be flexible with data types by guessing what was meant when an entry does not match the expected.
The as class is used to assist with operations requiring coercion. The as class forces an object to belong to a class:
class(as.character(14))
## [1] "character"
class(as.numeric("1"))
## [1] "numeric"
class(as.character(1:5)) #as.character typecasts the arg to a character
## [1] "character"
class(as.numeric("1","2","3")) #as.numeric typecasts the arg to a numeric
## [1] "numeric"
This functionality is quite useful in practice as many datasets that include numbers include them in a form that makes them appear to be character strings.
In practice, NA or ‘Not Available’ will appear if data is absent or cannot be properly coerced into a data type by R. This is VERY common in data science.
There are some clarity challenges with sorting. We need to focus on three functions:
The function sort() sorts a vector in increasing order.
x <- c(31, 4, 15, 92, 65)
sort(x)
## [1] 4 15 31 65 92
sort(murders$total) #sorts a vector in increasing order
## [1] 2 4 5 5 7 8 11 12 12 16 19 21 22 27 32
## [16] 36 38 53 63 65 67 84 93 93 97 97 99 111 116 118
## [31] 120 135 142 207 219 232 246 250 286 293 310 321 351 364 376
## [46] 413 457 517 669 805 1257
sort(murders$total,decreasing = TRUE) #sorts a vector in decreasing order
## [1] 1257 805 669 517 457 413 376 364 351 321 310 293 286 250 246
## [16] 232 219 207 142 135 120 118 116 111 99 97 97 93 93 84
## [31] 67 65 63 53 38 36 32 27 22 21 19 16 12 12 11
## [46] 8 7 5 5 4 2
The order() function returns an index that would be used to order the inputed vector. This index can then be applied to another vector for correlation purposes. It is like saying ‘I want you to tell me the steps to take to sort the provided vector into the same order as this other vector’
x <- c(31, 4, 15, 92, 65)
x # original vector we want to use as the template to 'order' another
## [1] 31 4 15 92 65
order(x) #returns the index of steps to take to put the vector into the same order as the original
## [1] 2 3 1 5 4
order(murders$total) #order returns the index which rearranges its first argument into ascending order (a utility)
## [1] 46 35 30 51 12 42 20 13 27 40 2 16 45 49 28 38 8 24 17 6 32 29 4 48 7
## [26] 50 9 37 18 22 25 1 15 41 43 3 31 47 34 21 36 26 19 14 11 23 39 33 10 44
## [51] 5
order(murders$total,decreasing = TRUE) #order returns the index which rearranges its first argument into descending order (a utility)
## [1] 5 44 10 33 39 23 11 14 19 26 36 21 34 47 31 3 43 41 15 1 25 22 18 37 9
## [26] 7 50 4 48 29 32 6 17 24 8 38 28 49 45 16 2 40 13 27 20 42 12 30 51 35
## [51] 46
ind <- order(murders$total,decreasing = TRUE) # 1.create a temporary vector that is indexed to arrange totals in decreasing order
murders$abb[ind] # 2.applies the index order to abbreviations using the vector created in the preceding step
## [1] "CA" "TX" "FL" "NY" "PA" "MI" "GA" "IL" "LA" "MO" "OH" "MD" "NC" "VA" "NJ"
## [16] "AZ" "TN" "SC" "IN" "AL" "MS" "MA" "KY" "OK" "DC" "CT" "WI" "AR" "WA" "NV"
## [31] "NM" "CO" "KS" "MN" "DE" "OR" "NE" "WV" "UT" "IA" "AK" "RI" "ID" "MT" "ME"
## [46] "SD" "HI" "NH" "WY" "ND" "VT"
murders$abb[order(murders$total,decreasing = TRUE)] # 3.does 1 and 2 in a single step
## [1] "CA" "TX" "FL" "NY" "PA" "MI" "GA" "IL" "LA" "MO" "OH" "MD" "NC" "VA" "NJ"
## [16] "AZ" "TN" "SC" "IN" "AL" "MS" "MA" "KY" "OK" "DC" "CT" "WI" "AR" "WA" "NV"
## [31] "NM" "CO" "KS" "MN" "DE" "OR" "NE" "WV" "UT" "IA" "AK" "RI" "ID" "MT" "ME"
## [46] "SD" "HI" "NH" "WY" "ND" "VT"
The rank() gives us the ranks of the items in the original vector.
x <- c(31, 4, 15, 92, 65)
x # original vector we want to rank
## [1] 31 4 15 92 65
rank(x)
## [1] 3 1 2 5 4
In R, arithmetic operations on vectors occur element-wise. The arithmetic operation is applied to each element of the vector.
heights <- c(69,62,66,70,70,73,67,73,67,70) #heights in inches
heights * 2.54 #heights converted to centimeters in one vector arithmetic operation
## [1] 175.26 157.48 167.64 177.80 177.80 185.42 170.18 185.42 170.18 177.80
If you have two vectors of the same length and datatype, these operations are even more powerful:
mya <- c(2,4,6,8,10) #first numeric vector
myb <- c(4,6,8,10,12) #second numeric vecor
myc <- c("a","b","c","d","e") #third character vector will throw a compile error 'non-numeric argument to binary operator'
myd <- c(1,2,3) #fourth numeric vector of unequal length will throw compile error 'longer object length is not a multiple of shorter object length'
mya * myb
## [1] 8 24 48 80 120
#mya * myc
#mya * myd
log2(2) #log 2 by base 2 log10(2) #log 2 by base 10 exp(2) #exp 2 (reverse of log) class(a) #returns the class of the argument data(murders) #adds the murders data frame from dslabs library to the workspace, try Ctrl+8 to see environment str(murders) #provides structural details on the data frame names(murders) #returns the feature names of the data frame head(murders) #shows the top 6 lines of the dataframe tail(murders) #shows the last 6 lines of the dataframe murders$population #using the $ accessor we can list the values of a feature/column, aka a vector length(murders) #returns the length of the vector, in this case 5 features/columns length(murders\(population) #returns the length of the vector, in this case 51 entries class(murders\)region) #returns as factor, or a categorical vector levels(murders\(region) #like names returns the headers of a vector, levels returns the categories in alphabetical order record <- list(name="Reilly Morris", #lists are a type of data frame you can create manually, assigning names is optional SIN=123456789, grades=c(95,82,91,97,93), final_grade="A") record\)name #you can access lists in the same was as you access dataframes using the $ accessor PROVIDED names were specified record[[“name”]] #alternatively, you can access a variable location in the list using double squares murders[25,1] #returns the 25 element in the 1st column of a data frame murders[2:3,] #returns all columns for the rows number 2 through 3 murders[,2:3] #returns all rows for the columns 2 through 3 table(murders\(region) #returns the frequency of the referenced elements in a vector (how many times does each region appear) table(heights\)sex) prop.table(table(heights\(sex)) barplot(heights\)height) ?hist(heights$height)
temp<-c(35,88,42,84,81,30) city<-c(“Beijing”,“Lagos”,“Paris”,“Rio de Janeiro”,“San Juan”,“Toronto”) city_temps<- data.frame(name=city,temperature=temp) #creates a data frame and labels the columns, populating each with the vectors provided as args data(“na_example”) #loads the sample data frame that contains NA values is.na(na_example) #returns a logical vector that tells us which entries are NA.
data("na_example")
dataWithNAs <- na_example
dataWithoutNAs <- dataWithNAs[!is.na(dataWithNAs)]
dataWithoutNAs
## [1] 2 1 3 2 1 3 1 4 3 2 2 2 2 1 4 1 1 2 1 2 2 1 2 5 2 2 3 1 2 4 1 1 1 4 5 2 3
## [38] 4 1 2 4 1 1 2 1 5 1 1 5 1 3 1 4 4 7 3 2 1 4 1 2 2 3 2 1 2 2 4 3 4 2 3 1 3
## [75] 2 1 1 1 3 1 3 1 2 2 1 2 2 1 1 4 1 1 2 3 3 2 2 3 3 3 4 1 1 1 2 4 3 4 3 1 2
## [112] 1 1 5 1 2 1 3 5 3 2 2 3 5 3 1 1 4 2 4 3 3 2 3 2 6 1 1 2 2 1 3 1 1 5 2 4 2
## [149] 5 1 4 3 3 4 3 1 4 1 1 3 1 1 3 5 2 2 2 3 1 2 2 3 2 1 2 1 2 1 1 3 1 2 2 1 3
## [186] 2 2 1 1 2 3 1 1 1 4 3 4 2 2 1 4 1 5 1 4 3 1 1 5 2 3 3 2 4 3 2 5 2 3 4 6 2
## [223] 2 2 2 2 3 3 2 2 4 3 1 4 2 2 4 6 2 3 1 2 2 1 1 3 2 3 3 1 1 4 2 1 1 3 2 1 2
## [260] 3 1 2 3 3 2 1 2 3 5 5 1 2 3 3 1 1 2 4 2 1 1 1 3 2 1 1 3 4 1 2 1 1 3 3 1 1
## [297] 3 5 3 2 3 4 1 4 3 1 2 1 2 2 1 2 2 6 1 2 4 5 3 4 2 1 1 4 2 1 1 1 1 2 1 4 4
## [334] 1 3 3 3 2 1 2 1 1 4 2 1 4 4 1 2 3 2 2 2 1 4 3 6 1 2 3 1 3 2 2 2 1 1 3 2 1
## [371] 1 1 3 2 2 4 4 4 1 1 4 3 1 3 1 3 2 4 2 2 2 3 2 1 4 3 1 4 3 1 3 2 3 1 3 1 4
## [408] 1 1 1 2 4 3 1 2 2 2 3 2 3 1 1 3 2 1 1 2 2 2 2 3 3 1 1 2 1 2 1 1 3 3 1 3 1
## [445] 1 1 1 1 2 5 1 1 2 2 1 1 1 4 1 2 4 1 3 2 1 1 2 1 1 4 2 3 3 1 5 3 1 1 2 1 1
## [482] 3 1 3 2 4 2 3 2 1 2 1 1 1 2 2 3 1 5 2 2 3 2 2 2 1 5 3 2 3 1 3 1 2 2 2 1 2
## [519] 2 4 6 1 2 1 1 2 2 3 3 2 3 3 4 2 2 4 1 1 2 2 3 1 1 1 3 2 5 7 1 4 3 3 1 1 1
## [556] 1 1 3 2 4 2 2 3 1 4 3 2 2 2 3 2 4 2 2 4 6 3 3 1 4 4 2 1 1 6 3 3 2 1 1 6 1
## [593] 5 1 2 6 2 4 1 3 1 2 1 1 3 1 2 4 2 1 3 2 4 3 2 2 1 1 5 6 4 2 2 2 2 4 1 2 2
## [630] 2 2 4 5 4 3 3 3 2 4 2 4 2 1 2 4 3 2 2 3 1 3 4 1 2 1 2 3 1 2 1 2 1 2 1 2 2
## [667] 2 2 1 1 3 3 1 3 4 3 4 2 3 2 1 3 2 4 2 2 3 1 2 4 3 3 4 1 4 2 1 1 1 3 1 5 2
## [704] 2 4 2 1 3 1 2 1 2 1 2 1 1 3 2 3 2 2 1 4 2 2 4 2 3 1 5 5 2 2 2 2 1 3 1 3 2
## [741] 4 2 4 4 1 2 3 2 3 3 2 3 2 2 2 1 3 2 4 2 3 3 2 2 3 2 1 2 4 1 1 1 1 4 3 2 3
## [778] 2 1 3 2 1 1 1 2 2 2 3 3 2 4 5 2 2 2 1 2 3 1 3 3 4 3 1 1 1 4 3 5 1 1 2 2 2
## [815] 2 2 5 2 2 3 1 2 3 1 2 2 3 1 1 2 5 3 5 1 1 4 2 1 3 1 1 2 4 3 3 3 1 1 2 2 1
## [852] 1 2 2 2
sum(is.na(na_example)) #returns the SUM of the NAs in the na_example vector. sum(!is.na(na_example)) #returns the SUM of the NON NAs in the na_example vector, leveraging the ! operator inches<-c(1,66,55,77,96) inches2.54 # using vector arithmetic an operation can be applied to all contents of a vector, here converting inches to cm murders\(total + murders\)population #when two equal length vectors are +, -, or /, each correlating cell will be acted upon murders\(total murders\)population murder_rate<-murders\(total/murders\)population100000 #when two equal length vectors are +, -, or /, each correlating cell will be acted upon murder_rate murders$state[order(murder_rate)] #based on the murder rate, lowest is Vermont, highest is District of Columbia
temp<-c(35,88,42,84,81,30) city<-c(“Beijing”,“Lagos”,“Paris”,“Rio de Janeiro”,“San Juan”,“Toronto”) city_temps<- data.frame(name=city,temperature=(5/9*(temp-32))) #apply to the entire vector the conversion from F to C city_temps
ind<- murder_rate<=0.71 # 1. creates a vector containing all rates equal to or lower than 0.71 (a utility) murders\(state[ind] # 2. to select from the state column all those indexed to have the corresponding murder rates. sum(ind) ind<- which(murders\)state==“California”) #which() returns an indices of TRUE values (a utility) murder_rate[ind]
ind<- which(murder_rate>5) #which() returns an indices of TRUE values (a utility) murders$state[ind]
ind<-match(c(“Delaware”,“Indiana”,“Nevada”),murders\(state) #match() returns a vector of the positions of (first) matches of its first argument in its second. ind city %in% ind #%in% is a more intuitive interface as a binary operator, which returns a logical vector indicating if there is a match or not for its left operand. match (c("New York","Florida","Texas"),murders\)state) # 1. returns the indices of the first from the second which(murders$state%in%c(“New York”,“Florida”,“Texas”)) # 2. returns the indices of the first from the second!
x<-murders\(population/10^6 y<-murders\)total plot(x,y) #returns a scatterplot of population on x, totals on y with(murders,plot(population,total)) # or do the same in a single step using the with()
hist(with(murders,murder_rate)) # using with works for other things too, like hists hist(with(murders,log2(murder_rate))) # and getting fancy using log2 from Section II!
murders$rate<-with(murders,total/population*100000) #using with() added a column WITHOUT mutate! Evaluate an R expression in an environment constructed from data, possibly modifying (a copy of) the original data. names(murders) boxplot(rate~region,data=murders) # produces box-and-whisker plot(s) of the given (grouped) values, coordinates are ordered y then x
murders<-mutate(murders,rate=total/population*100000) #mutate() adds new variables and preserves existing ones. #One of dplyr’s main features is its context sensitivity without the need for accessor calls #Note that mutate only modifies the working copy of the murders dataset- the original is unmodified filter(murders,rate<0.71) #filters the first argument by the second argument filter(murders,state!=“Florida”) filter(murders,state%in%c(“New York”,“Florida”)) filter(murders,region%in%c(“Northeast”,“West”)) filter(murders,region%in%c(“Northeast”,“West”)&rate<1)# can be done like this or… murders%>%select(state,region,rate)%>%filter(region%in%c(“Northeast”,“West”)&rate<1) #MUCH prettier version of previous step temp_table<-select(murders,state, region,rate) # select returns a new dataframe including the specified columns from the first argument class(temp_table) 16%>%sqrt()%>%log2()
murders%>%select(state,region,rate)%>%filter(region%in%c(“Northeast”,“West”)&rate<1)#when using pipes, we no longer need to specific the required first argument, since dplyr assumes it as the first argument #data%>%mutate something%>%select something%>%filter something
all() any()
#if-else: The if-else statement is particularly useful because it works on vectors. It examines #each element of the logical vector and answers each accordingly. #A very common use of this #function is replacing NAs in a vector with some other value, such as 0.
if(9 > 8){print(“first”)}else{print(“second”)}
#Replace NAs: data(na_example) no_nas <- ifelse(is.na(na_example), 0, na_example) sum(is.na(no_nas))
#alternatively ifelse(9 > 8, Inf, NA)
#custom functions avg <- function(x){ s <- sum(x) n <- length(x) s/n } x <- 1:100 identical(mean(x), avg(x))
#for loops for(i in 1:5){ print(i) }
#sapply x <- 1:10 sapply(x, sqrt)
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.