This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.
R is a free open source software environment for statistical computing and grpahics.
Download R: https://www.r-project.org/
R Studio is an Integrated Development Environment, commonly known as an IDE, that aims to make R easier to use by including a code editor, debugging, and visualization tools.
R Studio allows users to engage with R in a user-friendly environment.
Download R Studio: https://www.rstudio.com/
R is an open source software created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand and was released in the early 1990’s.
Before R’s creation the S language was developed by John Chambers at Bell Laboratories (now AT & T) in New Jersey during the 1970’s.
R is very similar in appearance to S, but the main difference between R and S which is worthy of observation, is simply that R is open source while S and its commercial product S-PLUS are not.
Open source software is a type of computer software whose source code is released under a license - often a General Public License (GNU) - which grants users the rights to study, change, and distribute the software to anyone for any purpose.
This allows for wider participation in the development and maintenance of R.
And most importantly it makes R free!
By default the R prompt that indicates R is ready and awaiting a command is the “>” symbol.
To avoid confusion with the mathematical symbol for greater than “>” you may modify this.
options(porrompt = "R> ")
Within R the # sign allows you to annotate comments.
Anything that comes after the hash mark will be ignored by the interpreter as a comment.
# This is a comment in R
1 + 4 #Comments also work after valid commands
## [1] 5
There are many packages included with R, but there are thousands that are unincluded.
To make them loadable in R you must download and install these packages.
Loading Packages (also referred to as libraries) is done by calling library.
#Install the popular data science package "tidyverse".
install.packages("tidyverse")
#Load (call) the package "MASS".
library(tidyverse)
#an advanced option to install only if necessary and call the library with two lines of code
if (!require(tidyverse)) install.packages("tidyverse")
library(tidyverse)
After you have installed and loaded your desired packages, you will need to provide them with maintenance.
This maintenance requires updating your packages to fix bugs and functionality.
#Update the popular data science package "tidyverse"
update.packages(tidyverse)
Asking for help is easy within R, becomes R is built with a suite of help files.
These help files come in handy and can be used to search for functionality, a functions use, or specify arguments.
Let us assume that we have forgotten the description, usage, and arguments for the arithmetic mean.
?mean
If you are unsure of the precise name of the desire function, you can search the documentaiton across all installed packages using a character string (a statement in double quotes) passed to help.search, or you can use “??”.
help.search(mean)
??"mean"
Saving and exiting R is uncomplicated.
You can use the ctrl + s to save your work as a .RData file.
The .RData file will allow you pick up where you left off. It will include all objects you have created and stored(in other words, assigned) within the session.
At any point during an R session you can execute the command ls() at the prompt, which will then list all objects, variables, and user-defined functions currently present in the active workspace.
ls()
Note: The quickest way to exit the software is to enter q() at the prompt.
Lets begin with some common operations and functionality using the script prompt in R Studio.
R is a reliable calculator so below we will cover basic arithmetic operations (+, -, *, / ), logarithms, exponentials, and e-notation.
# Note that print
2 + 5
## [1] 7
10 + 2 * 5
## [1] 20
10 + (2 * 5)
## [1] 20
14 / 5
## [1] 2.8
14 / 5 + 2
## [1] 4.8
14 / (5 + 2)
## [1] 2
Logarithms are useful for a host of reasons including log transformations which allow you to rescale numbers accoring to the logarithm.
If given a number x and a value referred to as the base, the logarithm calculates the power to which you must raise the base to get to x.
Using R you can complete a log transformation using the log() function.
Also worthy of consideration:
Both x and the base must be positive.
The log of any number x when the base is equal to x is 1.
the log of x = 1 is always 0, regardless of the base.
log(x = 243, base = 3)
## [1] 5
There is a distinct log transformation called the natural log.
The natual log fixes the base of the log at a particulare mathematical number - Eulers number. Eulers number is more commonly written as e.
e is a famous irrational number and its first dew digits amount to 2.7182818284590452353602874713527.
Eulers number allows users to to make use of the exponential function defined as e raised to the power of x, where x can be any number (negative, zero, or positive).
The exponential function, $f(x) = e^x is often written as exp(x).
The R command for the exponential function is exp:
exp(x = 3)
## [1] 20.08554
#The default behavior of the log is to assume the natural log:
log(x = 20.08554)
## [1] 3
When R prints large or small numbers beyond its default setting of 7 significant numbers , the numbers are displayed using scientific e-notation.
E-notation is common in most programming language and makes it significantly easier to interpret extreme values.
In e-notation, any number x can be expressed as Xey which is equivalent to x * 10^y*.
Lets consider 2,342,151,012,900.
2.3421510129e12 is equivalent to $2.3421510129 * 10^12
234.21510129e10 is equivalent to $234.21510129 * 10^10
For a positive power +y the e-notation can be interpreted as “move the decimal point y positions to the right.”
For a negative power -y the e-notation can be interpreted as “move the decimal point y positions to the left.”
#Note that regardless of the output R displays any extra digits will be stored by R and included in any caculations even if R has hidden the digits.
2342151012900
## [1] 2.342151e+12
0.0000002533
## [1] 2.533e-07
So far we have performed simple mathematical scripts and printed them to the console.
However, if we want to save the results and perform further operations we will need to be able to assign the results of a given computation to an objectin the current workspace.
You can specify an assignment in R in two ways:
<-); andx <- 5
x
## [1] 5
=).x = 10
x
## [1] 10
x = x + 1
this_number = 100
this_number
## [1] 100
this_number - x
## [1] 89
y = this_number - x
y
## [1] 89
#remember that at any point during an R session you can execute the command ls() at the prompt, which will then list all objects, variables, and user-defined functions currently present in the active workspace.
ls()
## [1] "this_number" "x" "y"
For purposes of tidyness and consistency it is important to be consistent in your code.
Choose either the equal sign (=) or the arrow notation (<-) and stick with it.
For those users of R whose goal is to write production level code, Google’s [R Style Guide] (*https://google.github.io/styleguide/Rguide.xml#assignment) suggests that R users use arrow notation(<-) and shy away from using the equal sign (=).
x <- 3^2 * 4^(1/8)
x / 2.33
## [1] 4.593504
y <- -8.2 * 10^-13
x * y
## [1] -8.776349e-12
A vector is an essential building block for handling multiple items in R.
Extremely complicated data structures may consist of several vectors.
Numerically, you should think of a vector as a collection of observations or measurements concerning a single variable, for example, the number of purchases you make each day or the heights of every NBA basketball player.
The function for creating a vector is the single letter c, with the desired parantheses separated by commas.
vector_1 <- c(1, 3, 4, 9)
vector_1
## [1] 1 3 4 9
Vector entries can also include calculations and previously stored items (including other vectors).
z <- 10
vector_2 <- c(3, 6 / 3, 2*3, 1e+03, 3^3, 5 + (45 + 5), 5 + (15-5) / z, z)
vector_2
## [1] 3 2 6 1000 27 55 6 10
Lets create vector_3, that will contain the entries of of vector_1 and vector_2 appended together and in that order.
vector_3 <- c(vector_1, vector_2)
vector_3
## [1] 1 3 4 9 3 2 6 1000 27 55 6 10
Lets work through the basics of sequencing, which will come in handy as you become more advanced with R and begin writing loops and plotting data.
The colon operator (:) offers the easiest way to create a sequence.
Numeric sequences are seperated by intervals of 1 in R’s default settings.
As an example, 1:10 should be read as “from 1 to 10 (by an interval of 1)”
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
15:50
## [1] 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
## [24] 38 39 40 41 42 43 44 45 46 47 48 49 50
-10:10
## [1] -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
## [18] 7 8 9 10
Sequences also work with previously stored values or strictly paranthesized calculations.
jim <- 10
fred <- jim:(100 + 25)
fred
## [1] 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## [18] 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
## [35] 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## [52] 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77
## [69] 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94
## [86] 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
## [103] 112 113 114 115 116 117 118 119 120 121 122 123 124 125
While sequences are by default separated by intervals of 1, R will allow you to determine your desired interval by adding a from, to, and by value to the seq() function.
It should be noted that these sequences will always begin with the from number but may not always end with the to number.
As an example if you are increasing (or decreasing) by even numbers and your sequence ends in an odd number then the final number will not be included.
seq(from = 0,to = 10,by = 2)
## [1] 0 2 4 6 8 10
seq(from = 100,to = 150,by = 5)
## [1] 100 105 110 115 120 125 130 135 140 145 150
seq(from = 0,to = 31,by = 2)
## [1] 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
length.out is an alternative to the by value that allows you to specifically generate a vector with that many numbers, evenly spaced between from and to values.
By setting length.out to 10 you make the program print exactly 10 evenly spaced numbers from 1 to 50.
seq(from = 1, to = 50, length.out = 10)
## [1] 1.000000 6.444444 11.888889 17.333333 22.777778 28.222222 33.666667
## [8] 39.111111 44.555556 50.000000
For decreasing sequences the use of by must be negative.
jim
## [1] 10
seq(from = jim, to = (-50 + 10), by = -2.5)
## [1] 10.0 7.5 5.0 2.5 0.0 -2.5 -5.0 -7.5 -10.0 -12.5 -15.0
## [12] -17.5 -20.0 -22.5 -25.0 -27.5 -30.0 -32.5 -35.0 -37.5 -40.0
Lets use length.out to create a decreasing sequence.
By setting length.out to 20 you make the program print exactly 20 evenly spaced numbers between this_number and -100 + -200.
this_number
## [1] 100
seq(from = this_number, to = -100 + -200, length.out = 20)
## [1] 100.000000 78.947368 57.894737 36.842105 15.789474
## [6] -5.263158 -26.315789 -47.368421 -68.421053 -89.473684
## [11] -110.526316 -131.578947 -152.631579 -173.684211 -194.736842
## [16] -215.789474 -236.842105 -257.894737 -278.947368 -300.000000
Repititions help you repeat a certain value.
The rep function is given a single value or vector of values as its argument x, as well as a value for the arguments times and each.
The value of times proivides the number of times to repeat x.
Each provides the number of times to repeat each element of x.
rep(x = 1, times = 4)
## [1] 1 1 1 1
rep(x = c(1, 2, 3, 4), times = 3)
## [1] 1 2 3 4 1 2 3 4 1 2 3 4
rep(x = c(1, 2, 3, 4), each = 2)
## [1] 1 1 2 2 3 3 4 4
rep(x = c(1, 2, 3, 4), times = 3, each = 2)
## [1] 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4
Another useful operation in R is found in its ability to sort vectors in the increasing or decreasing order of its elements.
The sort() function helps to do this.
You will encounter logical values in the examples below.
Logical values can be only one of two specific, case-sensitive values: TRUE or FALSE.
Typically, logicals are used to indicate the satisfaction or failure of a certain condition and they are an integral part of the all programming languages.
sort(x = c(4, 1, 2, 3), decreasing = FALSE)
## [1] 1 2 3 4
sort(x = c(-100, 32, 99, 12, -324, 87, 64, 1, -7), decreasing = TRUE)
## [1] 99 87 64 32 12 1 -7 -100 -324
Lets use the length function, which will determine how many entries exist in a vector given as the argument x.
length(x = c(1, 2, 3, 4))
## [1] 4
length(x = 2:20)
## [1] 19
You may have noticed the [] immediately to the left of all outputs in R.
These [] brackets represent the index of the entry to the right.
The index corresponds to the position of a value within a vector.
These indexes allow you to retrieve specific elements from a vector, which is known as subsetting.
You can access individual elements by asking R to return the values of vector_4 at specific locations.
This is done by entering the name of the vector followed by the position in square brackets.
Finally, a - sign in the [] reads “all of the elements except”.
vector_4 <- c(1, 200, 100, -5, 233, 2.2)
length(x = vector_4)
## [1] 6
vector_4[2]
## [1] 200
vector_4[-c(2, 4)]
## [1] 1.0 100.0 233.0 2.2
b <- vector_4[2]
b
## [1] 200
vector_4[length(x = b)]
## [1] 1
You can also delete individial brackets by using the negative versions of the indexes supplied in square brackets.
vector_4
## [1] 1.0 200.0 100.0 -5.0 233.0 2.2
vector_4[-2]
## [1] 1.0 100.0 -5.0 233.0 2.2
vector_4[-6]
## [1] 1 200 100 -5 233
vector_5 <- vector_4[-6]
A matrix is several vectors stored together.
The size of a matrix is described by its number of rows and columns m x n.
m = rows.
n = columns.
A matrix A will be a m x n matrix:
A will have exactly m rows;
and A will have exactly n columns.
So A will have a total of mn entries.
To create a matrix in R we will make use of the matrix command.
You must make sure that the number of desired rows nrows and columns ncols are equal.
You may elect to not supply the nrow and ncol.
A <- matrix(data = c(-10, 5, 500, 0.50), nrow = 2, ncol = 2)
A
## [,1] [,2]
## [1,] -10 500.0
## [2,] 5 0.5
G <- matrix(1:8, nrow = 4, ncol = 2)
G
## [,1] [,2]
## [1,] 1 5
## [2,] 2 6
## [3,] 3 7
## [4,] 4 8
To extract an entire row or column from a matrix, you simply specify the desired row or column number and leave the other blank.
Note that you must include the comma that separates the row and column numbers.
Otherwise R will be unable to distinguish between a row and a column.
H <- matrix(c(0, 5, 10, 15, 20, 80, -4, -10, -87), nrow = 3, ncol = 3)
H
## [,1] [,2] [,3]
## [1,] 0 15 -4
## [2,] 5 20 -10
## [3,] 10 80 -87
H[, 2]
## [1] 15 20 80
H[1, ]
## [1] 0 15 -4
H[2:3, ]
## [,1] [,2] [,3]
## [1,] 5 20 -10
## [2,] 10 80 -87
H[, c(3,1)]
## [,1] [,2]
## [1,] -4 0
## [2,] -10 5
## [3,] -87 10
H[c(3,1), 2:3]
## [,1] [,2]
## [1,] 80 -87
## [2,] 15 -4
Statistical programming also requires non-numeric values.
Three important non-numeric values include: logicals, characters, and factors.
Logical values, which are also referred to as logicals are based on a simple premise: a logical valued object can either be TRUE or FALSE.
Common interpretations of this are found in yes/no, one/zero, and satisfied/not satisfied responses.
Logical values in R are written fully as TRUE or FALSE but are commonly abbreviated as T or F.
Abbreviated versions of these logicals have no effect on the execution of the code (for convenience - try not to make objects with the assigned values of T or F).
Assigning logical values to an object is the same as assigning numeric values.
cindy <- TRUE
cindy
## [1] TRUE
kris <- F
kris
## [1] FALSE
kyle <- c(T, F, F, F, T, F, T, T, T, F, T, F)
kyle
## [1] TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE
## [12] FALSE
length(x = kyle)
## [1] 12
juice <- matrix(data = kyle, nrow = 3, ncol = 4, byrow = cindy)
Logicals are often used to check relationships between values.
You might want to know if some number a is greater or less than b.
Standard logical operators are used to produce logical values as results.
Below are some commong logical operators:
== Equal to
!= Not equal to
> Greater than
< Less than
>= Greater than or equal to
<= Less than or equal to
1 == 2
## [1] FALSE
1 > 2
## [1] FALSE
2 < 1
## [1] FALSE
(2-1) <= 2
## [1] TRUE
1 != (2 + 10)
## [1] TRUE
dirt <- c(1, 10, 110, -3, 67, 400)
house <- c(3, 9, 56, -40, 300, 2)
dirt == house
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
dirt < house
## [1] TRUE FALSE FALSE FALSE TRUE FALSE
house <= (dirt + 10)
## [1] TRUE TRUE TRUE TRUE FALSE TRUE
Logicals are especially useful when you want to examine whether multiple conditions are satisfied.
Lets view logical operators for use with two TRUE or FALSE objects.
The logical operators used to compare two TRUE or FALSE objects include AND, OR, and NOT.
& AND (element wise)
&& AND (single comparison)
| OR (element wise)
|| OR (single comparison)
! NOT
Note that there is an order of importance to logical operations in R.
AND statements have higher precedence than OR statements.
Place comparative pairs in parentheses to preserve their correct order.
FALSE || ((T && TRUE) | FALSE)
## [1] TRUE
! TRUE && TRUE
## [1] FALSE
(T && (TRUE || F)) && FALSE
## [1] FALSE
(6 < 4) || (3 != 1)
## [1] TRUE
mike <- c(T, F, F, F, T, F, T, T, T, F, T, F)
eric <- c(F, T, F, T, F, F, F, F, T, T, T, T)
mike
## [1] TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE
## [12] FALSE
eric
## [1] FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
## [12] TRUE
mike & eric
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
## [12] FALSE
mike | eric
## [1] TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
## [12] TRUE
mike && eric
## [1] FALSE
mike || eric
## [1] TRUE
Vectors, matrices, and arrays can only store one type of data.
Lists and data frames can store several types of values at once.
You create a list by using the list function and by supplying the elements you wish to include int he list, separated by commas.
The list below list_1 contains a 2x2 matrix, a logical vector, and a character string.
list_1 <- list(matrix(data = 1:4, nrow = 2, ncol = 2), c(T, F, T, T), "hello")
list_1
## [[1]]
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## [[2]]
## [1] TRUE FALSE TRUE TRUE
##
## [[3]]
## [1] "hello"
A data frame is the most natural way or presenting a data set with a collection of recorded observations for one or more variables using R.
Much like lists, data frames have no restriction on the data types of the variables they can store.
The data frame is one of the most important and frequently used tools in R.
The difference between a data frame and a list is that in a data frame (unlike a list), the members must all be vectors of equal length.
To create a data frame from scratch we will use the data.frame function.
Note: the dollar operator ($) is used to get the named member or variable.
df1 <- data.frame(people = c("Peter", "Lois", "Meg", "Chris", "Stewie"), age = c(42, 40, 18, 23, 1), sex = factor(c("M", "F", "F", "M", "M")))
df1
## people age sex
## 1 Peter 42 M
## 2 Lois 40 F
## 3 Meg 18 F
## 4 Chris 23 M
## 5 Stewie 1 M
df1[2, 2]
## [1] 40
df1[3:5, 3]
## [1] F M M
## Levels: F M
df1[, c(3, 1)]
## sex people
## 1 M Peter
## 2 F Lois
## 3 F Meg
## 4 M Chris
## 5 M Stewie
df1$age
## [1] 42 40 18 23 1
df1$age[2]
## [1] 40
nrow(df1)
## [1] 5
ncol(df1)
## [1] 3
dim(df1)
## [1] 5 3
df1$people
## [1] Peter Lois Meg Chris Stewie
## Levels: Chris Lois Meg Peter Stewie
If you wan to add data to an existing data frame you could use the rbind and cbind functions to append rows and columns.
new_rec <- data.frame(person = "Brian", age = 7, sex = factor("M", levels = levels(df1$sex)))
new_rec
## person age sex
## 1 Brian 7 M
Adding a data.frame is fairly straightforward.
Lets say we are given data on the classification of how funny six individuals are.
Each of these six indivduals is assigned a “degree of funniness”.
There are threee values for the degrees of funniness: - Low - Med (Medium) - High
Lets assume that: - Low: Meg - Med: Chris & Brian - High: Peter, Lois, and Stewie
#create the basic character vector as funny
funny <- c("High", "High", "Low", "Med", "High", "Med")
#overwrite the basic character vector by turning it into a factor
funny <- factor(x = funny, levels = c("Low", "Med", "High"))
Davies, Tilman M. The Book of R. San Francisco, CA. No Starch Press, Inc. 2016.