1.1 R Basics, Functions and Data Types

Objects

Create objects and store values

a <- 3
b <- 4
c <- a+b

Return stored values

a
## [1] 3

With the print function

print(c)
## [1] 7

Functions

We can use pre-built functions like install.packages, sqrt or ls

log(8)
## [1] 2.079442
log(a)
## [1] 1.098612

Call help files with the help function help(log) or ?log

To know the arguments needed in a function without using the help docs:

args(log)
## function (x, base = exp(1)) 
## NULL

We can change the default values by assigning another object

log(8,base=2)
## [1] 3

If no argument name is included R will understand it in the same order as the args() specs

log(8,2)
## [1] 3

To solve an equation like 3x^2 + 2x -1

a <- 3
b <- 2
c <- -1
(-b + sqrt(b^2 - 4*a*c)) / (2*a)
(-b - sqrt(b^2 - 4*a*c)) / (2*a)
## [1] 0.3333333
## [1] -1

Data types

Function class help us determine what type we have

a <- 3
class(a)
## [1] "numeric"

Data frames

Conceptually, we can think of a data frame as a table with rows representing observations and the different variables reported for each observation defining the columns. Data frames are particularly useful for datasets because we can combine different data types into one object.

library(dslabs)
data("murders")
class(murders)
## [1] "data.frame"

Examining an object

We can use the function str() to know more about the structure of an object. In the example below we can see it has 51 rows and 5 variables, with the data type of each. E.g. state is character and region is Factor

str(murders)
## 'data.frame':    51 obs. of  5 variables:
##  $ state     : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ abb       : chr  "AL" "AK" "AZ" "AR" ...
##  $ region    : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
##  $ population: num  4779736 710231 6392017 2915918 37253956 ...
##  $ total     : num  135 19 232 93 1257 ...

To inspect the first 6 lines we can use the function head()

head(murders)
##        state abb region population total
## 1    Alabama  AL  South    4779736   135
## 2     Alaska  AK   West     710231    19
## 3    Arizona  AZ   West    6392017   232
## 4   Arkansas  AR  South    2915918    93
## 5 California  CA   West   37253956  1257
## 6   Colorado  CO   West    5029196    65

The accessor $

We can use the accessor to access variables within a data frame

murders$population
##  [1]  4779736   710231  6392017  2915918 37253956  5029196  3574097   897934
##  [9]   601723 19687653  9920000  1360301  1567582 12830632  6483802  3046355
## [17]  2853118  4339367  4533372  1328361  5773552  6547629  9883640  5303925
## [25]  2967297  5988927   989415  1826341  2700551  1316470  8791894  2059179
## [33] 19378102  9535483   672591 11536504  3751351  3831074 12702379  1052567
## [41]  4625364   814180  6346105 25145561  2763885   625741  8001024  6724540
## [49]  1852994  5686986   563626

We can quickly access the variable names with the function names()

names(murders)
## [1] "state"      "abb"        "region"     "population" "total"

Vectors: numerics, characters, and logical

The object murders$population is not one number but several. We call these types of objects vectors. A single number is technically a vector of length 1, but in general we use the term vectors to refer to objects with several entries. The function length tells you how many entries are in the vector:

pop <- murders$population
length(pop)
## [1] 51

We can inspect the vector class

class(murders$population)
## [1] "numeric"

Vectors of class character store character strings

class(murders$state)
## [1] "character"

Logical vectors must be either true or false

z <- 3==2
z
## [1] FALSE

Factors

region is a factor in the murders dataset.

class(murders$region)
## [1] "factor"

Factors store categorical data, in this case there are 4 levels

levels(murders$region)
## [1] "Northeast"     "South"         "North Central" "West"

We can use reorder to change the order of labels, e.g. by total murders.

region <- murders$region
value <- murders$total
region <- reorder(region, value, FUN = sum)
levels(region)
## [1] "Northeast"     "North Central" "West"          "South"

Matrices

Matrices are similar to data frames in that they are two-dimensional: they have rows and columns. However, like numeric, character and logical vectors, entries in matrices have to be all the same type.

We can use the function matrix with arguments (values, number of rows, number of columns)

mat <- matrix(1:12,4,3)
mat
##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12

To access specific entries in a matrix

mat[2,3]
## [1] 10

If you want the entire second row

mat[2,]
## [1]  2  6 10

Similar procedure if we want the entire first column

mat[,1]
## [1] 1 2 3 4

To access a subset of the matrix

mat[3:4,2:3]
##      [,1] [,2]
## [1,]    7   11
## [2,]    8   12

We can also use brackets to access subsets of data frames

data("murders")
murders[25,1]
## [1] "Mississippi"

Using murders[1:6,] would be the same as using head(murders)

murders[1:6,]
##        state abb region population total
## 1    Alabama  AL  South    4779736   135
## 2     Alaska  AK   West     710231    19
## 3    Arizona  AZ   West    6392017   232
## 4   Arkansas  AR  South    2915918    93
## 5 California  CA   West   37253956  1257
## 6   Colorado  CO   West    5029196    65

     

1.2 Vectors and Sorting

Most basic objects available to store data are vectors

Vectors

Creating Vectors

We use concatenate c to create vectors

codes <- c(380, 124, 818)
codes
## [1] 380 124 818
countries <- c("USA","Canada","Mexico")
countries
## [1] "USA"    "Canada" "Mexico"

 

Names

We can name the entries of a vector

country_codes <- c(USA=43, Canada=4, Mexico=33)
country_codes
##    USA Canada Mexico 
##     43      4     33
class(country_codes)
## [1] "numeric"
names(country_codes)
## [1] "USA"    "Canada" "Mexico"

 

Sequencies

Will create vectors

ten_numbers <- seq(1,10)
ten_numbers
##  [1]  1  2  3  4  5  6  7  8  9 10
cerofive <- seq(-10,10,0.5)
cerofive
##  [1] -10.0  -9.5  -9.0  -8.5  -8.0  -7.5  -7.0  -6.5  -6.0  -5.5  -5.0  -4.5
## [13]  -4.0  -3.5  -3.0  -2.5  -2.0  -1.5  -1.0  -0.5   0.0   0.5   1.0   1.5
## [25]   2.0   2.5   3.0   3.5   4.0   4.5   5.0   5.5   6.0   6.5   7.0   7.5
## [37]   8.0   8.5   9.0   9.5  10.0
class(cerofive)
## [1] "numeric"
class(1:10)
## [1] "integer"

 

Subsetting

To access specific values of a vector

cerofive[3]
## [1] -9
cerofive[c(3,4)]
## [1] -9.0 -8.5

 

Coercion

Flexibility of R with data types

mix <- c(3, "Canada", TRUE)
mix
## [1] "3"      "Canada" "TRUE"
class(mix)
## [1] "character"

We can convert between data types

numbers <- c(1,3)
str(numbers)
##  num [1:2] 1 3
numchar <- as.character(numbers)
str(numchar)
##  chr [1:2] "1" "3"

Warning R can return NAs

string <- c("1", "Canada", "45")
as.numeric(string)
## Warning: NAs introduced by coercion
## [1]  1 NA 45

 

Sorting

sort

Say we want to rank the states from least to most gun murders. The function sort sorts a vector in increasing order

library(dslabs)
data("murders")
sort(murders$total)
##  [1]    2    4    5    5    7    8   11   12   12   16   19   21   22   27   32
## [16]   36   38   53   63   65   67   84   93   93   97   97   99  111  116  118
## [31]  120  135  142  207  219  232  246  250  286  293  310  321  351  364  376
## [46]  413  457  517  669  805 1257

 

order

It takes a vector as input and returns the vector of indexes that sorts the input vector.

x <- c(31, 4, 15, 92, 65)
sort(x)
## [1]  4 15 31 65 92

Rather than sort the input vector, the function order returns the index that sorts input vector:

index <- order(x)
x[index]
## [1]  4 15 31 65 92

We can order the states names by total murders. According to R California had the most murders

index <- order(murders$total)
murders$abb[index]
##  [1] "VT" "ND" "NH" "WY" "HI" "SD" "ME" "ID" "MT" "RI" "AK" "IA" "UT" "WV" "NE"
## [16] "OR" "DE" "MN" "KS" "CO" "NM" "NV" "AR" "WA" "CT" "WI" "DC" "OK" "KY" "MA"
## [31] "MS" "AL" "IN" "SC" "TN" "AZ" "NJ" "VA" "NC" "MD" "OH" "MO" "LA" "IL" "GA"
## [46] "MI" "PA" "NY" "FL" "TX" "CA"

 

max and which.max

Entry with the largest value

max(murders$total)
## [1] 1257

which.max will return the index of the entry with largest value

which.max(murders$total)
## [1] 5

to return the state name

murders$state[which.max(murders$total)]
## [1] "California"

or

maxmurder <- which.max(murders$total)
murders$state[maxmurder]
## [1] "California"

 

rank

Returns a vector with the rank of the first entry, second entry, etc

rank(x)
## [1] 3 1 2 5 4

 

rank, sort and order

##   original order sort rank
## 1       31     2    4    3
## 2        4     3   15    1
## 3       15     1   31    2
## 4       92     5   65    5
## 5       65     4   92    4

 

Vector arithmetics

We can apply arithmetics to vectors

murder_rate <- murders$total/murders$population*10^5
murder_rate
##  [1]  2.8244238  2.6751860  3.6295273  3.1893901  3.3741383  1.2924531
##  [7]  2.7139722  4.2319369 16.4527532  3.3980688  3.7903226  0.5145920
## [13]  0.7655102  2.8369608  2.1900730  0.6893484  2.2081106  2.6732010
## [19]  7.7425810  0.8280881  5.0748655  1.8021791  4.1786225  0.9992600
## [25]  4.0440846  5.3598917  1.2128379  1.7521372  3.1104763  0.3798036
## [31]  2.7980319  3.2537239  2.6679599  2.9993237  0.5947151  2.6871225
## [37]  2.9589340  0.9396843  3.5977513  1.5200933  4.4753235  0.9825837
## [43]  3.4509357  3.2013603  0.7959810  0.3196211  3.1246001  1.3829942
## [49]  1.4571013  1.7056487  0.8871131

Converting a vector of distances from inches to cm

c <- c(50, 65, 70, 95)
c_cm <- c*2.54
c_cm
## [1] 127.0 165.1 177.8 241.3

     

1.3 Indexing, Data Wrangling and Plots

Indexing

Subsetting with logicals

If we wanted to know states with murder rate lower or equal to 0.71

rate071 <- murder_rate <= 0.71
murders$state[rate071]
## [1] "Hawaii"        "Iowa"          "New Hampshire" "North Dakota" 
## [5] "Vermont"

In order to count how many are TRUE, the function sum returns the sum of the entries of a vector and logical vectors get coerced to numeric with TRUE coded as 1 and FALSE as 0. Thus we can count the states using:

sum(rate071)
## [1] 5

 

Logical operators

TRUE & TRUE
## [1] TRUE
TRUE & FALSE
## [1] FALSE
FALSE & FALSE
## [1] FALSE

Creating the two logical vectors representing our conditions

west <- murders$region == "West"
safe <- murder_rate <= 1

Defining an index and identifying states with both conditions true

index <- safe & west
murders$state[index]
## [1] "Hawaii"  "Idaho"   "Oregon"  "Utah"    "Wyoming"

 

which

To look up California’s murder rate for example

index <- which(murders$state=="California")
murder_rate[index]
## [1] 3.374138

 

match

Find out the murder rates for several states

ind <- match(c("New York", "Florida", "Texas"), murders$state)
murder_rate[ind]
## [1] 2.667960 3.398069 3.201360

 

%in%

Check if a vector is within another vector with %in%

c("Boston", "Dakota", "Washington") %in% murders$state
## [1] FALSE FALSE  TRUE

 

Exercises

# Store the 5 abbreviations in `abbs`. (remember that they are character vectors)
abbs <- c("MA","ME","MI","MO","MU")

# Use the %in% command to check if the entries of abbs are abbreviations in the the murders data frame
abbs%in%murders$abb
## [1]  TRUE  TRUE  TRUE  TRUE FALSE
# Use the `which` command and `!` operator to find out which index abbreviations are not actually part of the dataset and store in `ind`
ind <- which(!abbs%in%murders$abb)

# Names of abbreviations in `ind`
abbs[ind]
## [1] "MU"

 

Data wrangling

# installing and loading the dplyr package
install.packages("dplyr")
library(dplyr)
library(dslabs)
data("murders")

Now we are adding a column with rate. The mutate function knows how to look for the variables total and population that belong to the data frame murders

murders_with_rate <- mutate(murders,rate=total/population*100000)
str(murders_with_rate)
## 'data.frame':    51 obs. of  6 variables:
##  $ state     : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ abb       : chr  "AL" "AK" "AZ" "AR" ...
##  $ region    : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
##  $ population: num  4779736 710231 6392017 2915918 37253956 ...
##  $ total     : num  135 19 232 93 1257 ...
##  $ rate      : num  2.82 2.68 3.63 3.19 3.37 ...

We can filter and show only rows of the table that meet a condition, for example states with murder rate equal or lower than 0.71

filter(murders_with_rate,rate<=0.71)
##           state abb        region population total      rate
## 1        Hawaii  HI          West    1360301     7 0.5145920
## 2          Iowa  IA North Central    3046355    21 0.6893484
## 3 New Hampshire  NH     Northeast    1316470     5 0.3798036
## 4  North Dakota  ND North Central     672591     4 0.5947151
## 5       Vermont  VT     Northeast     625741     2 0.3196211

select allows to work with just some columns from a table, for example we can create a new table that shows only state, region and rate, that meet the previous condition

new_table <- select(murders_with_rate,state,region,rate)
filter(new_table,rate<=0.71)
##           state        region      rate
## 1        Hawaii          West 0.5145920
## 2          Iowa North Central 0.6893484
## 3 New Hampshire     Northeast 0.3798036
## 4  North Dakota North Central 0.5947151
## 5       Vermont     Northeast 0.3196211

We can do the above from the original table > select > filter with only line of code using the pipe function.we take the original murders > apply select > apply filter

murders_with_rate %>% select(state,region,rate) %>% filter(rate<=0.71)
##           state        region      rate
## 1        Hawaii          West 0.5145920
## 2          Iowa North Central 0.6893484
## 3 New Hampshire     Northeast 0.3798036
## 4  North Dakota North Central 0.5947151
## 5       Vermont     Northeast 0.3196211

 

Creating data frames

Creating a data frame with stringAsFactors = FALSE, will keep names as characters instead of defining them as factors, which is the default of strings when creating data frames

grades <- data.frame(names = c("John", "Juan", "Jean", "Yao"), 
                     exam_1 = c(95, 80, 90, 85), 
                     exam_2 = c(90, 85, 85, 90),
                     stringsAsFactors = FALSE)
grades
##   names exam_1 exam_2
## 1  John     95     90
## 2  Juan     80     85
## 3  Jean     90     85
## 4   Yao     85     90

Redefine murders to include a column named rank with the ranks of rate from highest to lowest

rate <- murders$total/murders$population/10^5
murders <- mutate(murders,rank=rank(-rate))

head(murders)
##        state abb region population total rank
## 1    Alabama  AL  South    4779736   135   23
## 2     Alaska  AK   West     710231    19   27
## 3    Arizona  AZ   West    6392017   232   10
## 4   Arkansas  AR  South    2915918    93   17
## 5 California  CA   West   37253956  1257   14
## 6   Colorado  CO   West    5029196    65   38

Filter to show the top 5 states with the highest murder rates

filter(murders,rank<=5)
##                  state abb        region population total rank
## 1 District of Columbia  DC         South     601723    99    1
## 2            Louisiana  LA         South    4533372   351    2
## 3             Maryland  MD         South    5773552   293    4
## 4             Missouri  MO North Central    5988927   321    3
## 5       South Carolina  SC         South    4625364   207    5

Use filter to create a new data frame no_south

no_south <- filter(murders,region!="South")

Use nrow() to calculate the number of rows

nrow(no_south)
## [1] 34
head(no_south)
##         state abb    region population total rank
## 1      Alaska  AK      West     710231    19   27
## 2     Arizona  AZ      West    6392017   232   10
## 3  California  CA      West   37253956  1257   14
## 4    Colorado  CO      West    5029196    65   38
## 5 Connecticut  CT Northeast    3574097    97   25
## 6      Hawaii  HI      West    1360301     7   49

If we want to filter table with 2 conditions, example region is South and total murders greater than 300

filter(murders,region=="South"&total>300)
##       state abb region population total rank
## 1   Florida  FL  South   19687653   669   13
## 2   Georgia  GA  South    9920000   376    9
## 3 Louisiana  LA  South    4533372   351    2
## 4     Texas  TX  South   25145561   805   16

Filter regions in NE and W, with rate lower than 1, and select state, rate, rank, order by rank

murders <- murders %>% mutate(rate=total/population*10^5)

filter(murders, region %in% c("Northeast","West") & rate < 1) %>% select(state,rate,rank) %>% arrange(rank)
##           state      rate rank
## 1        Oregon 0.9396843   42
## 2       Wyoming 0.8871131   43
## 3         Maine 0.8280881   44
## 4          Utah 0.7959810   45
## 5         Idaho 0.7655102   46
## 6        Hawaii 0.5145920   49
## 7 New Hampshire 0.3798036   50
## 8       Vermont 0.3196211   51

 

Basic plots

plot

The plot function can be used to make scatterplots. Here is a plot of total murders versus population.

x <- murders$population / 10^6
y <- murders$total
plot(x, y)

It can also be plotted using with

with(murders, plot(population, total))

 

hist

x <- with(murders, total/population*10^5)
hist(x)

 

boxplot

boxplot(rate~region, data = murders)

 

image

The image function displays the values in a matrix using color. Here is a quick example:

x <- matrix(1:120, 12, 10)
image(x)

     

1.4 Programming Basics

 

Conditional expressions

ifstatement

a <- 0

if(a!=0){
  print(1/a)
} else{
  print("No reciprocal for 0.")
}
## [1] "No reciprocal for 0."

ifelsestatement

First argument is the condition, second argument execute if met, third argument if condition is not met

a <- c(0, 1, 2, -4, 5)
result <- ifelse(a > 0, 1/a, NA)
result
## [1]  NA 1.0 0.5  NA 0.2

Example of a function to replace missing spaces in vector (NAs) with ceros

data(na_example)
sum(is.na(na_example))
## [1] 145
no_nas <- ifelse(is.na(na_example), 0, na_example) 
sum(is.na(no_nas))
## [1] 0

The any() and all() functions evaluate logical vectors

# any() will return TRUE if any of the conditions is met
# all() will return TRUE if all conditions are met
z <- c(TRUE, TRUE, FALSE)
any(z)
## [1] TRUE
all(z)
## [1] FALSE

 

Functions

Example of defining a function to compute the average of a vector

x <- c(34,2,56,28)
avg <- function(x){   #define avg as a function
  s <- sum(x)             # s is the sum of all values
  n <- length(x)          # n is the length
  s/n                         # last line is what it will return
}
avg(x)
## [1] 30

s and n are not objects defined in the workspace, they only exist within the function

 

for loops

Creating a function that computes the sum of integers 1 through n: 1 + 2 + 3 + … + n

sum_formula <- function(n) {
  x <- 1:n
  sum(x)
  }

sum_formula(100)
## [1] 5050

A very simple for-loop

sum_formula <- function(n) {
  x <- 1:n
  sum(x)
  }

sum_formula(100)
## [1] 5050

We can use a for loop to create a vector that contains the sum_formula for the the numbers 1 to 25

m <- 25

#then we create an empty vector where values will be stored

s_n <- vector(length=m)

#then we run the formula for m values

for (n in 1:m) {
  s_n[n] <- sum_formula(n)
}

# creating a plot for our summation function
n <- 1:m
plot(n, s_n)

# a table of values comparing our function to the summation formula
head(data.frame(s_n = s_n, formula = n*(n+1)/2))
##   s_n formula
## 1   1       1
## 2   3       3
## 3   6       6
## 4  10      10
## 5  15      15
## 6  21      21
# overlaying our function with the summation formula
plot(n, s_n)
lines(n, n*(n+1)/2)