Create objects and store values
a <- 3
b <- 4
c <- a+b
Return stored values
a
## [1] 3
With the print function
print(c)
## [1] 7
We can use pre-built functions like install.packages, sqrt or ls
log(8)
## [1] 2.079442
log(a)
## [1] 1.098612
Call help files with the help function help(log) or ?log
To know the arguments needed in a function without using the help docs:
args(log)
## function (x, base = exp(1))
## NULL
We can change the default values by assigning another object
log(8,base=2)
## [1] 3
If no argument name is included R will understand it in the same order as the args() specs
log(8,2)
## [1] 3
To solve an equation like 3x^2 + 2x -1
a <- 3
b <- 2
c <- -1
(-b + sqrt(b^2 - 4*a*c)) / (2*a)
(-b - sqrt(b^2 - 4*a*c)) / (2*a)
## [1] 0.3333333
## [1] -1
Function class help us determine what type we have
a <- 3
class(a)
## [1] "numeric"
Conceptually, we can think of a data frame as a table with rows representing observations and the different variables reported for each observation defining the columns. Data frames are particularly useful for datasets because we can combine different data types into one object.
library(dslabs)
data("murders")
class(murders)
## [1] "data.frame"
We can use the function str() to know more about the structure of an object. In the example below we can see it has 51 rows and 5 variables, with the data type of each. E.g. state is character and region is Factor
str(murders)
## 'data.frame': 51 obs. of 5 variables:
## $ state : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ abb : chr "AL" "AK" "AZ" "AR" ...
## $ region : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
## $ population: num 4779736 710231 6392017 2915918 37253956 ...
## $ total : num 135 19 232 93 1257 ...
To inspect the first 6 lines we can use the function head()
head(murders)
## state abb region population total
## 1 Alabama AL South 4779736 135
## 2 Alaska AK West 710231 19
## 3 Arizona AZ West 6392017 232
## 4 Arkansas AR South 2915918 93
## 5 California CA West 37253956 1257
## 6 Colorado CO West 5029196 65
$We can use the accessor to access variables within a data frame
murders$population
## [1] 4779736 710231 6392017 2915918 37253956 5029196 3574097 897934
## [9] 601723 19687653 9920000 1360301 1567582 12830632 6483802 3046355
## [17] 2853118 4339367 4533372 1328361 5773552 6547629 9883640 5303925
## [25] 2967297 5988927 989415 1826341 2700551 1316470 8791894 2059179
## [33] 19378102 9535483 672591 11536504 3751351 3831074 12702379 1052567
## [41] 4625364 814180 6346105 25145561 2763885 625741 8001024 6724540
## [49] 1852994 5686986 563626
We can quickly access the variable names with the function names()
names(murders)
## [1] "state" "abb" "region" "population" "total"
The object murders$population is not one number but several. We call these types of objects vectors. A single number is technically a vector of length 1, but in general we use the term vectors to refer to objects with several entries. The function length tells you how many entries are in the vector:
pop <- murders$population
length(pop)
## [1] 51
We can inspect the vector class
class(murders$population)
## [1] "numeric"
Vectors of class character store character strings
class(murders$state)
## [1] "character"
Logical vectors must be either true or false
z <- 3==2
z
## [1] FALSE
region is a factor in the murders dataset.
class(murders$region)
## [1] "factor"
Factors store categorical data, in this case there are 4 levels
levels(murders$region)
## [1] "Northeast" "South" "North Central" "West"
We can use reorder to change the order of labels, e.g. by total murders.
region <- murders$region
value <- murders$total
region <- reorder(region, value, FUN = sum)
levels(region)
## [1] "Northeast" "North Central" "West" "South"
Matrices are similar to data frames in that they are two-dimensional: they have rows and columns. However, like numeric, character and logical vectors, entries in matrices have to be all the same type.
We can use the function matrix with arguments (values, number of rows, number of columns)
mat <- matrix(1:12,4,3)
mat
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
To access specific entries in a matrix
mat[2,3]
## [1] 10
If you want the entire second row
mat[2,]
## [1] 2 6 10
Similar procedure if we want the entire first column
mat[,1]
## [1] 1 2 3 4
To access a subset of the matrix
mat[3:4,2:3]
## [,1] [,2]
## [1,] 7 11
## [2,] 8 12
We can also use brackets to access subsets of data frames
data("murders")
murders[25,1]
## [1] "Mississippi"
Using murders[1:6,] would be the same as using head(murders)
murders[1:6,]
## state abb region population total
## 1 Alabama AL South 4779736 135
## 2 Alaska AK West 710231 19
## 3 Arizona AZ West 6392017 232
## 4 Arkansas AR South 2915918 93
## 5 California CA West 37253956 1257
## 6 Colorado CO West 5029196 65
  Â
Most basic objects available to store data are vectors
We use concatenate c to create vectors
codes <- c(380, 124, 818)
codes
## [1] 380 124 818
countries <- c("USA","Canada","Mexico")
countries
## [1] "USA" "Canada" "Mexico"
Â
We can name the entries of a vector
country_codes <- c(USA=43, Canada=4, Mexico=33)
country_codes
## USA Canada Mexico
## 43 4 33
class(country_codes)
## [1] "numeric"
names(country_codes)
## [1] "USA" "Canada" "Mexico"
Â
Will create vectors
ten_numbers <- seq(1,10)
ten_numbers
## [1] 1 2 3 4 5 6 7 8 9 10
cerofive <- seq(-10,10,0.5)
cerofive
## [1] -10.0 -9.5 -9.0 -8.5 -8.0 -7.5 -7.0 -6.5 -6.0 -5.5 -5.0 -4.5
## [13] -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
## [25] 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5
## [37] 8.0 8.5 9.0 9.5 10.0
class(cerofive)
## [1] "numeric"
class(1:10)
## [1] "integer"
Â
To access specific values of a vector
cerofive[3]
## [1] -9
cerofive[c(3,4)]
## [1] -9.0 -8.5
Â
Flexibility of R with data types
mix <- c(3, "Canada", TRUE)
mix
## [1] "3" "Canada" "TRUE"
class(mix)
## [1] "character"
We can convert between data types
numbers <- c(1,3)
str(numbers)
## num [1:2] 1 3
numchar <- as.character(numbers)
str(numchar)
## chr [1:2] "1" "3"
Warning R can return NAs
string <- c("1", "Canada", "45")
as.numeric(string)
## Warning: NAs introduced by coercion
## [1] 1 NA 45
Â
sortSay we want to rank the states from least to most gun murders. The function sort sorts a vector in increasing order
library(dslabs)
data("murders")
sort(murders$total)
## [1] 2 4 5 5 7 8 11 12 12 16 19 21 22 27 32
## [16] 36 38 53 63 65 67 84 93 93 97 97 99 111 116 118
## [31] 120 135 142 207 219 232 246 250 286 293 310 321 351 364 376
## [46] 413 457 517 669 805 1257
Â
orderIt takes a vector as input and returns the vector of indexes that sorts the input vector.
x <- c(31, 4, 15, 92, 65)
sort(x)
## [1] 4 15 31 65 92
Rather than sort the input vector, the function order returns the index that sorts input vector:
index <- order(x)
x[index]
## [1] 4 15 31 65 92
We can order the states names by total murders. According to R California had the most murders
index <- order(murders$total)
murders$abb[index]
## [1] "VT" "ND" "NH" "WY" "HI" "SD" "ME" "ID" "MT" "RI" "AK" "IA" "UT" "WV" "NE"
## [16] "OR" "DE" "MN" "KS" "CO" "NM" "NV" "AR" "WA" "CT" "WI" "DC" "OK" "KY" "MA"
## [31] "MS" "AL" "IN" "SC" "TN" "AZ" "NJ" "VA" "NC" "MD" "OH" "MO" "LA" "IL" "GA"
## [46] "MI" "PA" "NY" "FL" "TX" "CA"
Â
max and which.maxEntry with the largest value
max(murders$total)
## [1] 1257
which.max will return the index of the entry with largest value
which.max(murders$total)
## [1] 5
to return the state name
murders$state[which.max(murders$total)]
## [1] "California"
or
maxmurder <- which.max(murders$total)
murders$state[maxmurder]
## [1] "California"
Â
rankReturns a vector with the rank of the first entry, second entry, etc
rank(x)
## [1] 3 1 2 5 4
Â
rank, sort and order
## original order sort rank
## 1 31 2 4 3
## 2 4 3 15 1
## 3 15 1 31 2
## 4 92 5 65 5
## 5 65 4 92 4
Â
We can apply arithmetics to vectors
murder_rate <- murders$total/murders$population*10^5
murder_rate
## [1] 2.8244238 2.6751860 3.6295273 3.1893901 3.3741383 1.2924531
## [7] 2.7139722 4.2319369 16.4527532 3.3980688 3.7903226 0.5145920
## [13] 0.7655102 2.8369608 2.1900730 0.6893484 2.2081106 2.6732010
## [19] 7.7425810 0.8280881 5.0748655 1.8021791 4.1786225 0.9992600
## [25] 4.0440846 5.3598917 1.2128379 1.7521372 3.1104763 0.3798036
## [31] 2.7980319 3.2537239 2.6679599 2.9993237 0.5947151 2.6871225
## [37] 2.9589340 0.9396843 3.5977513 1.5200933 4.4753235 0.9825837
## [43] 3.4509357 3.2013603 0.7959810 0.3196211 3.1246001 1.3829942
## [49] 1.4571013 1.7056487 0.8871131
Converting a vector of distances from inches to cm
c <- c(50, 65, 70, 95)
c_cm <- c*2.54
c_cm
## [1] 127.0 165.1 177.8 241.3
  Â
If we wanted to know states with murder rate lower or equal to 0.71
rate071 <- murder_rate <= 0.71
murders$state[rate071]
## [1] "Hawaii" "Iowa" "New Hampshire" "North Dakota"
## [5] "Vermont"
In order to count how many are TRUE, the function sum returns the sum of the entries of a vector and logical vectors get coerced to numeric with TRUE coded as 1 and FALSE as 0. Thus we can count the states using:
sum(rate071)
## [1] 5
Â
TRUE & TRUE
## [1] TRUE
TRUE & FALSE
## [1] FALSE
FALSE & FALSE
## [1] FALSE
Creating the two logical vectors representing our conditions
west <- murders$region == "West"
safe <- murder_rate <= 1
Defining an index and identifying states with both conditions true
index <- safe & west
murders$state[index]
## [1] "Hawaii" "Idaho" "Oregon" "Utah" "Wyoming"
Â
whichTo look up California’s murder rate for example
index <- which(murders$state=="California")
murder_rate[index]
## [1] 3.374138
Â
matchFind out the murder rates for several states
ind <- match(c("New York", "Florida", "Texas"), murders$state)
murder_rate[ind]
## [1] 2.667960 3.398069 3.201360
Â
%in%Check if a vector is within another vector with %in%
c("Boston", "Dakota", "Washington") %in% murders$state
## [1] FALSE FALSE TRUE
Â
# Store the 5 abbreviations in `abbs`. (remember that they are character vectors)
abbs <- c("MA","ME","MI","MO","MU")
# Use the %in% command to check if the entries of abbs are abbreviations in the the murders data frame
abbs%in%murders$abb
## [1] TRUE TRUE TRUE TRUE FALSE
# Use the `which` command and `!` operator to find out which index abbreviations are not actually part of the dataset and store in `ind`
ind <- which(!abbs%in%murders$abb)
# Names of abbreviations in `ind`
abbs[ind]
## [1] "MU"
Â
# installing and loading the dplyr package
install.packages("dplyr")
library(dplyr)
library(dslabs)
data("murders")
Now we are adding a column with rate. The mutate function knows how to look for the variables total and population that belong to the data frame murders
murders_with_rate <- mutate(murders,rate=total/population*100000)
str(murders_with_rate)
## 'data.frame': 51 obs. of 6 variables:
## $ state : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ abb : chr "AL" "AK" "AZ" "AR" ...
## $ region : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
## $ population: num 4779736 710231 6392017 2915918 37253956 ...
## $ total : num 135 19 232 93 1257 ...
## $ rate : num 2.82 2.68 3.63 3.19 3.37 ...
We can filter and show only rows of the table that meet a condition, for example states with murder rate equal or lower than 0.71
filter(murders_with_rate,rate<=0.71)
## state abb region population total rate
## 1 Hawaii HI West 1360301 7 0.5145920
## 2 Iowa IA North Central 3046355 21 0.6893484
## 3 New Hampshire NH Northeast 1316470 5 0.3798036
## 4 North Dakota ND North Central 672591 4 0.5947151
## 5 Vermont VT Northeast 625741 2 0.3196211
select allows to work with just some columns from a table, for example we can create a new table that shows only state, region and rate, that meet the previous condition
new_table <- select(murders_with_rate,state,region,rate)
filter(new_table,rate<=0.71)
## state region rate
## 1 Hawaii West 0.5145920
## 2 Iowa North Central 0.6893484
## 3 New Hampshire Northeast 0.3798036
## 4 North Dakota North Central 0.5947151
## 5 Vermont Northeast 0.3196211
We can do the above from the original table > select > filter with only line of code using the pipe function.we take the original murders > apply select > apply filter
murders_with_rate %>% select(state,region,rate) %>% filter(rate<=0.71)
## state region rate
## 1 Hawaii West 0.5145920
## 2 Iowa North Central 0.6893484
## 3 New Hampshire Northeast 0.3798036
## 4 North Dakota North Central 0.5947151
## 5 Vermont Northeast 0.3196211
Â
Creating a data frame with stringAsFactors = FALSE, will keep names as characters instead of defining them as factors, which is the default of strings when creating data frames
grades <- data.frame(names = c("John", "Juan", "Jean", "Yao"),
exam_1 = c(95, 80, 90, 85),
exam_2 = c(90, 85, 85, 90),
stringsAsFactors = FALSE)
grades
## names exam_1 exam_2
## 1 John 95 90
## 2 Juan 80 85
## 3 Jean 90 85
## 4 Yao 85 90
Redefine murders to include a column named rank with the ranks of rate from highest to lowest
rate <- murders$total/murders$population/10^5
murders <- mutate(murders,rank=rank(-rate))
head(murders)
## state abb region population total rank
## 1 Alabama AL South 4779736 135 23
## 2 Alaska AK West 710231 19 27
## 3 Arizona AZ West 6392017 232 10
## 4 Arkansas AR South 2915918 93 17
## 5 California CA West 37253956 1257 14
## 6 Colorado CO West 5029196 65 38
Filter to show the top 5 states with the highest murder rates
filter(murders,rank<=5)
## state abb region population total rank
## 1 District of Columbia DC South 601723 99 1
## 2 Louisiana LA South 4533372 351 2
## 3 Maryland MD South 5773552 293 4
## 4 Missouri MO North Central 5988927 321 3
## 5 South Carolina SC South 4625364 207 5
Use filter to create a new data frame no_south
no_south <- filter(murders,region!="South")
Use nrow() to calculate the number of rows
nrow(no_south)
## [1] 34
head(no_south)
## state abb region population total rank
## 1 Alaska AK West 710231 19 27
## 2 Arizona AZ West 6392017 232 10
## 3 California CA West 37253956 1257 14
## 4 Colorado CO West 5029196 65 38
## 5 Connecticut CT Northeast 3574097 97 25
## 6 Hawaii HI West 1360301 7 49
If we want to filter table with 2 conditions, example region is South and total murders greater than 300
filter(murders,region=="South"&total>300)
## state abb region population total rank
## 1 Florida FL South 19687653 669 13
## 2 Georgia GA South 9920000 376 9
## 3 Louisiana LA South 4533372 351 2
## 4 Texas TX South 25145561 805 16
Filter regions in NE and W, with rate lower than 1, and select state, rate, rank, order by rank
murders <- murders %>% mutate(rate=total/population*10^5)
filter(murders, region %in% c("Northeast","West") & rate < 1) %>% select(state,rate,rank) %>% arrange(rank)
## state rate rank
## 1 Oregon 0.9396843 42
## 2 Wyoming 0.8871131 43
## 3 Maine 0.8280881 44
## 4 Utah 0.7959810 45
## 5 Idaho 0.7655102 46
## 6 Hawaii 0.5145920 49
## 7 New Hampshire 0.3798036 50
## 8 Vermont 0.3196211 51
Â
plotThe plot function can be used to make scatterplots. Here is a plot of total murders versus population.
x <- murders$population / 10^6
y <- murders$total
plot(x, y)
It can also be plotted using with
with(murders, plot(population, total))
Â
histx <- with(murders, total/population*10^5)
hist(x)
Â
boxplotboxplot(rate~region, data = murders)
Â
imageThe image function displays the values in a matrix using color. Here is a quick example:
x <- matrix(1:120, 12, 10)
image(x)
  Â
Â
ifstatementa <- 0
if(a!=0){
print(1/a)
} else{
print("No reciprocal for 0.")
}
## [1] "No reciprocal for 0."
ifelsestatementFirst argument is the condition, second argument execute if met, third argument if condition is not met
a <- c(0, 1, 2, -4, 5)
result <- ifelse(a > 0, 1/a, NA)
result
## [1] NA 1.0 0.5 NA 0.2
Example of a function to replace missing spaces in vector (NAs) with ceros
data(na_example)
sum(is.na(na_example))
## [1] 145
no_nas <- ifelse(is.na(na_example), 0, na_example)
sum(is.na(no_nas))
## [1] 0
The any() and all() functions evaluate logical vectors
# any() will return TRUE if any of the conditions is met
# all() will return TRUE if all conditions are met
z <- c(TRUE, TRUE, FALSE)
any(z)
## [1] TRUE
all(z)
## [1] FALSE
Â
Example of defining a function to compute the average of a vector
x <- c(34,2,56,28)
avg <- function(x){ #define avg as a function
s <- sum(x) # s is the sum of all values
n <- length(x) # n is the length
s/n # last line is what it will return
}
avg(x)
## [1] 30
s and n are not objects defined in the workspace, they only exist within the function
Â
for loopsCreating a function that computes the sum of integers 1 through n: 1 + 2 + 3 + … + n
sum_formula <- function(n) {
x <- 1:n
sum(x)
}
sum_formula(100)
## [1] 5050
A very simple for-loop
sum_formula <- function(n) {
x <- 1:n
sum(x)
}
sum_formula(100)
## [1] 5050
We can use a for loop to create a vector that contains the sum_formula for the the numbers 1 to 25
m <- 25
#then we create an empty vector where values will be stored
s_n <- vector(length=m)
#then we run the formula for m values
for (n in 1:m) {
s_n[n] <- sum_formula(n)
}
# creating a plot for our summation function
n <- 1:m
plot(n, s_n)
# a table of values comparing our function to the summation formula
head(data.frame(s_n = s_n, formula = n*(n+1)/2))
## s_n formula
## 1 1 1
## 2 3 3
## 3 6 6
## 4 10 10
## 5 15 15
## 6 21 21
# overlaying our function with the summation formula
plot(n, s_n)
lines(n, n*(n+1)/2)