Latest Versions & Updates: This markdown document was built using the following versions of R and RStudio:
We’ve seen a few snippets of code in Intro to R: Nuts & Bolts. Now, we’re going to start seeing some more expressions “in the field”. Thanks to the implementation of literate programming, I’m able to weave machine-readable R expressions into this work while explaining those expressions in human-readable language.
Note that unformatted font, for example, this, that, and the other thing, is used to indicate machine-readable language, even if it’s used in-line, like the example. It’s a simple and unobtrusive way to differentiate human-readable language from expressions intended for machine consumption. Note this particular formatting, or lack thereof, when you see it - typically, it’s used to flag datasets, variables, entire expressions, function and package names, and operators.
So-called code chunks, unlike unformatted font, are much more easy to discern. “Code chunks” allow literate programming authors to insert machine-readable code in human-readable text. Behind the scenes, “code chunks” are often executed, without alerting the reader, in order to produce tables, visualizations, interactive tools, and more. In instructional materials, e.g. the present work, code chunks are used for demonstrative purposes, such as how to use a particular function. The following is an example of two “code chunks”, the first of which executes without output, and the second of which will both execute the expression and print the resulting output.
my_example <- "This is an example of a code chunk."
Now, we’ll both both execute and print the results.
print(my_example)
## [1] "This is an example of a code chunk."
When I first began studying R, one of my more regrettable mistakes - apart from not learning R earlier in life - was that I’d read literature on R and simply look at the coding examples. This was an error. If possible, try running every bit of ostensibly non-malicious code you find. There’s a reason most literature on R takes advantage of literate programming via code chunks, so read with RStudio open, and experiment with new expressions in the R console often.
Using Local Data: Where appropriate, we’ll either demonstrate using or practice with squeaky clean, local data from CNY Vitals Pro. These data are invariably well-formatted, small in size, and excellent for instruction. As other sources are introduced, don’t just use R’s built-in data, use the data from the world around you. It’s a bit more motivating, and you’ll hone your domain expertise and hacking skills simultaneously.
No introduction to R would be complete without an introductory example of how R functions like a scientific calculator - a very powerful calculator, but a calculator nonetheless. Understanding this, however, is the foundation on which rests the architecture of your hacking skills.
Data are comprised of values. Though there are many kinds of data, typically the most common are numeric values, which work just like numbers in a basic calculator. We can perform operations on numeric data using arithmetic operators, for example:
+ for addition- for subtraction/ for division* for multiplication^ for exponents() for parenthesesThese arithmetic operators may be used in expressions to perform arithmetic calculations, like addition:
2 + 2
## [1] 4
Likewise, there’s subtraction:
5 - 1
## [1] 4
Let’s not forget multiplication or division:
(3 * 4) / 3
## [1] 4
And, of course, exponents, like 2 cubed:
2^2
## [1] 4
Do you recall the “order of operations”“, sometimes referred to as “operator precedence”, you learned back in grade school? Me neither. But I do remember “Please Excuse My Dear Aunt Sally” (or “PEMDAS”), i.e. (1) Parentheses, (2) Exponents, (3) Multiplication, (4) Division, (5) Addition, and (6) Subtraction. R typically follows the same order for more complex expressions.
This holds true in R, as well. Let’s look at a more complex expression:
2 + (6 * 2) / ((3^2) / 3) - 2
## [1] 4
Here, R evaluates the expressions in the parentheses first (“Please” or “P”), i.e. (6 * 2) and ((3^2) / 3), respectively. Because (3^2) are parentheses inside a parentheses, it’s evaluated before all others. It’s like the film Inception, except it makes sense.
(3^2)
## [1] 9
(9 / 3)
## [1] 3
That was the second instance of () in the expression, albeit broken down into smaller pieces. Let’s see if R calculates the entire contents within the () in the same manner:
((3^2) / 3)
## [1] 3
Sweet, it seems so. Let’s look at all the operations within (), i.e. (6 * 2) / ((3^2) / 3). Here, R follows “PEMDAS” to the letter (heh). It begins by evaluating the contents of the () (“Please” or “P”), followed by evaluation of the / (“Dear” or “D”).
(6 * 2) / ((3^2) / 3)
## [1] 4
The 2 + and - 2 cancel each other out, but would be evaluated last, per “PEMDAS”, resulting in 4.
R is an object-oriented programming, or OOP language. While explaining OOP falls outside the scope of this introduction, it’s critical to understand the importance of objects. In fact, you may hear the word “object” quite a bit, as objects are essentially devices that store information. In OOP, objects are self-contained and fiercely guarded, and may only be acted on or changed through the express use of functions (sometimes referred to as “methods”). The curious learner may wish to learn about this OOP property, called “encapsulation”.
Just about everything, apart from bare values, are objects. Data are stored in various ways within objects, from a small collection of values to massive, tabular datasets. A single value (i.e. a datum) may be stored in an object. Functions are stored as objects. A string, or a sequence of letters or numbers, may be stored in an object. Even arithmetic operators, which are actually functions, are also objects, albeit “primitive” ones.
R contains many built-in objects, from functions to datasets. By way of example, let’s look at two objects: letters and LETTERS, which contain all 26 letters in the English alphabet in lower and upper case, respectively. By simply typing the name of the object, letters, R uses an auto-printing mechanism to automatically print the contents of the object.
letters
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"
Though it’s not often necessary outside of creating new functions or making your code more readable, we can explicitly command R to print the object’s contents using the print() function. Let’s try this with the object LETTERS.
LETTERS
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
## [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
This is also demonstrative of how R is case sensitive. For example, LetteRs is not recognized as a built-in object.
During an R session, objects are typically stored locally in your workspace. We can easily store individual values, datasets, functions, and even entire expressions by using the assignment operator, or <-. The object to the left of the assignment operator is assigned the information to the right of the assignment operator.
Let’s see what this looks like in action. We’ll store the numeric value 7 in the object named lucky_number:
lucky_number <- 7
Now, let’s print the contents of the object, lucky_number, using R’s auto-print mechanism. That is, we simply type and run the object name:
lucky_number
## [1] 7
Like singular values, we can assign entire expressions to an object. Let’s use the same expression on which we practiced the order of operations, 2 + (6 * 2) / ((3^2) / 3) - 2. We’ll name the object my_equation. Note that the entirety of the expression to the right of the assignment operator (<-) will be evaluated and stored in the object to the left of the operator.
my_equation <- 2 + (6 * 2) / ((3^2) / 3) - 2
Recall that all of the arithmetic expressions in the above examples evaluated to 4. Let’s call the object my_equation to see what happens.
my_equation
## [1] 4
Egad! The object, my_equation, now stores a single value: 4.
In the above example, the object, my_equation, and the value, 4, are interchangeable. Let’s see what that means by way of example. First, we’ll use the object lucky_number in an arithmetic operation. Recall that the value in lucky_number is 7:
lucky_number - 2
## [1] 5
By subtracting 2 from lucky_number, the expression then evaluates to 5, i.e. 7 - 5.
What about my_equation, our object that stored an expression that evaluated to a single value: 4?
1 + my_equation
## [1] 5
Again, the object acts interchangeably with the value, resulting in 5. For the grand finale, let’s find the sum of objects lucky_number and my_equation, equal to 7 and 4, respectively:
lucky_number + my_equation
## [1] 11
As one might expect, both objects are evaluated arithmetically to 7 + 4, respectively, summing to 11. This has enormous implications.
Do not use a “space” when naming an object. An error message will be thrown, as R will fail to recognize what may be perceived as two individual objects. This returns us to the conventions discussed in Intro to R: Nuts & Bolts, and especially case. When naming objects, you can use periods (.), underscores (_), or CamelCaps to created compound object names of more than one word.
What’s more, R only recognizes objects when they are “bare”. That is, when they are not in quotes. Observe:
lucky_number
## [1] 7
Compare this to:
"lucky_number"
## [1] "lucky_number"
In the first scenario, R is able to recognize that lucky_number is an object, and correctly prints its contents. In the second scenario, the quotations ("") signal to R that "lucky_number" is not an object, but a string (a sequence of characters). In effect, it simply prints the sequence as output. Keep this in mind going forward - sometimes you may need to add quotes, other times you may need to omit them, depending on your intention. Like with case sensitivity, R is also sensitive to quotation, and this may be a minor source of frustration for new R users.
Once you’ve initialized an object, whether it contains information or is entirely empty (which is possible), RStudio neatly lists stored objects in the upper-right “Environment” panel, as well as displays the first few values stored, if possible. You can print all stored objects to the R console or use them in your code. If you happen to have many objects stored, you can easily print those objects with the function ls(), the “List Objects” function. Note that ls() requires no additional inputs.
ls()
## [1] "lucky_number" "my_equation" "my_example"
RStudio neatly arranges and labels your objects in the “Environment” panel.
Lastly, we can easily remove an object from our workspace or other environments using the function rm() and inputting the name of the object to be removed. Here, we’ll remove lucky_number using rm() and then inspect our remaining objects using function ls().
rm(lucky_number)
ls()
## [1] "my_equation" "my_example"
Function rm() may not seem immediately useful, but it can be very beneficial if you “clean as you go”, similar to preparing a meal at home. As soon as an object is obsolete, remove it with rm() to keep your workspace organized.
You can combine the “List Objects” function ls() and “Remove Objects” function rm() to remove all objects present in your workspace. The input to ls() requires an argument, or a specified input to a function: list =. We’ll learn more about arguments when we discuss the anatomy of a function, but for now, it’s useful to know how to do this both in your script or in console.
rm(list = ls())
Instead of using the above code, you can simply click on the broom icon located in the upper-right “Environment” panel in RStudio.
You can remove all objects by clicking the broom icon in RStudio’s “Environment” panel.
Aside from arithmetic operators, R also evaluates relational operators, sometimes called “comparators”. In human-readable format, relational operators are often described as:
Unlike the output of arithmetic operators, which are generally numeric, output from relational operators are logical, that is, they are either one of two values: TRUE or FALSE. As an aside, you can determine whether a value or set of values (including objects) are numeric or logical by calling the class() function to determine their class.
class(15)
## [1] "numeric"
class(TRUE)
## [1] "logical"
We’ll learn more about classes in later tutorials.
Relational operators or logical values are also referred to as “binary” or “Boolean”. For an in-depth treatment of logical operators, the curious learner may read the official CRAN documentation.
You can probably think of a few different relational operators now that we’ve defined them. To wit, they include:
> for “greater than”< for “less than”>= for “greater than or equal to”<= for “less than or equal to”== for “exactly equal to”!= for “not equal to”Again we may fuzzily rely on our grade school takeaways. Invariably, logic statements which satisfy the criteria to be true will always result in TRUE, while statements which do not satisfy the expression’s criteria are evaluated as FALSE.
Let’s begin with a simple example: 5 == 5, or “5 is equal to 5”. Hopefully, we can evaluate a priori that this statement is, in fact, TRUE. Let’s run it in the R console to find out:
5 == 5
## [1] TRUE
As suspected, 5 == 5 evaluates to TRUE. Let’s observe another statement, 5 > 10, or “5 is greater than 10”, which we suspect will evaluate to FALSE:
5 > 10
## [1] FALSE
Again, we’re right on the money.
Interestingly, the ! operator is the negation of a statement. With this operator present in a logical statement, it negates the logical values evaluated, i.e. TRUE becomes FALSE, and vice versa. Observe how the statement 5 != 3, or “5 is not equal to 3”, which we may evaluate to be TRUE:
5 != 3
## [1] TRUE
When beginning a logical statement with !, while the statement is wrapped in parantheses, (), it negates the entirety of the statement. Let’s negate the evaluation of 10 < 20, or “10 is less than 20”:
!(10 < 20)
## [1] FALSE
While 10 < 20 evaluates to TRUE, we negate the entire logical statement using the ! operator, instead evaluating to FALSE.
Logical operators combine logical statements in a manner which evaluates to TRUE if either one, more than one, or all logical statements evaluate to TRUE. Let’s further define these:
& or && for “and”: Two or more statements evaluate to TRUE| or || for “or”: At least one statement evaluates to TRUELet’s take a gander at an example to better understand how logical operators work. First, the & operator evaluates two or more statements and, if all evaluate to TRUE, the entire expression also evalautes to TRUE. We’ll use two simple statements (sometimes referred to in this context as “operands”): 1 == 1 or “1 equals 1” and 10 < 5, or “10 is less than 5”. Since the first statement is TRUE and the second is FALSE, the entire expression evaluates to FALSE, since not all criteria for TRUE are met in the expression:
1 == 1 & 10 < 5
## [1] FALSE
Simple enough. Again, & requires that both operands be TRUE to evaluate the entire expression as TRUE.
We can make this expression evaluate to TRUE by using negation with the ! operator. Since 10 < 5 evaluates to FALSE, wrapping the statement in parentheses and preceding it with ! will coerce R to evaluate it as TRUE, i.e. TRUE & TRUE.
1 == 1 & !(10 < 5)
## [1] TRUE
Cool!
Related to the & (“and”) operator is | (“or”), which only requires 1 of 2 operands to be TRUE. Let’s evaluate our original example, 1 == 1 & 10 < 5. Recall that the former operand is TRUE, while the latter is FALSE. Since we’ll instead use | (“or”), only 1 operand needs to be TRUE. Therefore, the following should evaluate to TRUE:
1 == 1 | 10 < 5
## [1] TRUE
Takeaway: Relational and logical operators can create very complex logical statements, but logic is a pillar of coding, and the more you hone your abilities in logic, the better you’ll be able to parse code into individual elements and understand how they interact. Just as important, relational operators are key to filtering data based on criteria, as well as control flow structures - i.e. if \(x\) is greater than \(y\), perform \(z\).
We’ll practice more with logic at the end of this tutorial.
Recall that objects may store one or more values and may be used as variables in arithmetic operations. The same is true in logical statements. Here, we’ll assign the value 25 to object: five_squared and evaluate a simple logical statement. First, let’s perform an assignment statement with the assignment operator, or <-:
five_squared <- 25
Now, let’s exponentiate the value 5 using the arithmetic operator for exponents, ^, and see if it equals 25, the value stored in five_squared:
five_squared == 5^2
## [1] TRUE
Again, this has broad implications in Object-Oriented Programming (OOP).
We’ve seen how to store a single value in an object, as we did with five_squared (25) and lucky_number (7) before that. One of the most simple data structures in R for storing multiple values of the same class is known as a vector, and may be created using the concatenate function, or c(). Function c() takes an infinite number of values of the same class, separated by commas (,) as input. Observe:
c(2, 4, 6, 8, 10)
## [1] 2 4 6 8 10
Some R users refer to this as “combining” values, which is incorrect. In concatenation, using c(), values are stored as distinct elements. You can think of “combining” values as a “melting pot”, where multiple values are combined to create a single value, while “concatenating” values is more of a “salad bowl”. Concatenated values preserve the distinct elements as separate from one another, allowing you to extract a crouton or a cherry tomato, while keep the values organized in container of sorts.
We can assign concatenated values of the same class to an object using function c() and assignment operator <-:
even_numbers <- c(2, 4, 6, 8, 10)
Print the contents of the vector even_numbers using R’s auto-print mechanism by running the nae of the object:
even_numbers
## [1] 2 4 6 8 10
The number of elements in a vector are measured in length using function length(). We can determine how “long” a vector is, i.e. how many elements it contains, by passing the object name as an argument to length(). For example, recall that letters is a built-in object containing all the lowercase letters of the English alphabet. How many elements do you believe exist in this object? Let’s find out:
length(letters)
## [1] 26
What about our newly-created vector, even_numbers?
length(even_numbers)
## [1] 5
You’ll work with vectors frequently, so it’s important to understand how they’re created, measured, and used.
You should think of a single value, e.g. 29, as a vector of length 1. That is, it is a vector with only one element!
R has a special property known as recycling. Vectors, when interacting with one another, will evluate on an element-by-element basis (for better or worse). Let’s create a new vector, odd_numbers, with the same length as even_numbers.
odd_numbers <- c(1, 3, 5, 7, 9)
Again, we’ll verify the lengths of even_numbers and odd_numbers using function length() and use a logical statement to confirm that their lengths are equal:
length(even_numbers)
## [1] 5
length(odd_numbers)
## [1] 5
length(even_numbers) == length(odd_numbers)
## [1] TRUE
Awesome! Now let’s create a vector of length 1, stored in an object named one_element. The value stored will be 5. We’ll confirm the length with length():
one_element <- 5
length(one_element)
## [1] 1
Since even_numbers is of length 5 and one_element is of length 1, using a relational operator between the two vectors will force R to recycle the single-element vector iteratively over each of the 5 elements in even_numbers. For example, let’s evaluate one_element < even_numbers.
one_element < even_numbers
## [1] FALSE FALSE TRUE TRUE TRUE
What happened? R recycles the value 5 in one_element iteratively, in 5 separate comparisons, for each element of even_numbers. That is:
5 < 25 < 45 < 65 < 85 < 10In effect, these return a new vector comrpised of 5 logical values for each evaluation!
Let’s apply these new techniques to a local dataset from CNY Vitals Pro. Specifically, we’ll look at “Poverty Over Time” in Syracuse. Run the following code to store these data as object: poverty. These data originate in Census ACS, which you may explore here.
url <- "https://tinyurl.com/ybbpdc9q"
poverty <- read.csv(url, stringsAsFactors = FALSE)
rm(url)
We’ll also install the dplyr package to keep things simple. Don’t mind the code for now, but do run it in your console!
if(!require(dplyr)){install.packages("dplyr")}
library(dplyr)
We can use logical statements to filter the values of a vector using package dplyr and function filter(). First, we call the name of the dataset, poverty, as the first argument, and the second argument is a logical statement with the dataset’s variable name, then the logical statement.
First, take a look at the contents of poverty with dplyr function glimpse():
glimpse(poverty)
## Observations: 416
## Variables: 9
## $ ID.Year <int> 2009, 2009, 2009, 2009, 2009, 2009, 200...
## $ Year <int> 2009, 2009, 2009, 2009, 2009, 2009, 200...
## $ ID.Sex <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Sex <chr> "male", "male", "male", "male", "male",...
## $ ID.Age.Group <int> 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, ...
## $ Age.Group <chr> "under5", "under5", "5", "5", "6to11", ...
## $ ID.Poverty.Line.Status <int> 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, ...
## $ Poverty.Line.Status <chr> "below", "above", "below", "above", "be...
## $ Population.Sum <dbl> 2443, 2301, 408, 699, 2456, 2892, 842, ...
Note that there are 416 observations (rows). Let’s create a statement which filters variable Poverty.Line.Status by below, so we only view data on demographics below the poverty line. We’ll assign the results to a new variable: poverty_below:
poverty_below <- filter(poverty, Poverty.Line.Status == "below")
Note that instead of == applied to a numeric value, we’ve applied it to a character value: below. Let’s inspect our new dataset, poverty_below with function glimpse().
glimpse(poverty_below)
## Observations: 208
## Variables: 9
## $ ID.Year <int> 2009, 2009, 2009, 2009, 2009, 2009, 200...
## $ Year <int> 2009, 2009, 2009, 2009, 2009, 2009, 200...
## $ ID.Sex <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Sex <chr> "male", "male", "male", "male", "male",...
## $ ID.Age.Group <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1...
## $ Age.Group <chr> "under5", "5", "6to11", "12to14", "15",...
## $ ID.Poverty.Line.Status <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Poverty.Line.Status <chr> "below", "below", "below", "below", "be...
## $ Population.Sum <dbl> 2443, 408, 2456, 842, 311, 570, 3357, 1...
We’ve effectively split the data in half! Now there are 208 observations. Say we’re interested in only analyzing data from 2014 or later. We can create a new statement with relational operator <= (“greater than or equal to”) on variable Year. We’ll store this in object poverty_below_2014_up:
poverty_below_2014_up <- filter(poverty_below, Year >= 2014)
Notice how we used the input to filter() as the newly created object poverty_below. Let’s quick check the new number of rows and columns in poverty_below_2014_up using function dim(), i.e. “dimensions”:
dim(poverty_below_2014_up)
## [1] 78 9
Great! We’ve successfully reduced the number of observations from 416 to 78. Now we can perform a more focused analysis.
We may not always want to filter our dataset to explore or verify some sort of information. For example, using the poverty dataset with which we started, we can simply print a list of logical values depending on some criteria we’ve set. You can subset a variable from a dataset using the name of the dataset, the $ operator, and the name of the variable. For example:
poverty$Year
## [1] 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009
## [15] 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009
## [29] 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009
## [43] 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2010 2010 2010 2010
## [57] 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010
## [71] 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010
## [85] 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010
## [99] 2010 2010 2010 2010 2010 2010 2011 2011 2011 2011 2011 2011 2011 2011
## [113] 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011
## [127] 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011
## [141] 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011
## [155] 2011 2011 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012
## [169] 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012
## [183] 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012
## [197] 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2013 2013
## [211] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013
## [225] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013
## [239] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013
## [253] 2013 2013 2013 2013 2013 2013 2013 2013 2014 2014 2014 2014 2014 2014
## [267] 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014
## [281] 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014
## [295] 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014
## [309] 2014 2014 2014 2014 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015
## [323] 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015
## [337] 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015
## [351] 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015
## [365] 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016
## [379] 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016
## [393] 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016
## [407] 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016
This actually prints a vector comprised of the value of variable Year for every single observation (\(n\) = 416). What happens if we use a relational operator on this vector? Say we’re only interested in data from 2011 and want to know how many observations match this criteria. Observe:
poverty$Year == 2011
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [111] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [122] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [133] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [144] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [155] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [166] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [177] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [188] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [199] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [210] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [221] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [232] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [243] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [254] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [265] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [276] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [287] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [298] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [309] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [320] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [331] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [342] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [353] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [364] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [375] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [386] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [397] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [408] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
The expression simply prints a number of logical values, with TRUE appearing for values occurring in 2011.
We can do better. Using function which(), we can determine the row numbers for all observations which evaluate to TRUE.
which(poverty$Year == 2011)
## [1] 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121
## [18] 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138
## [35] 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155
## [52] 156
That might not be immediately useful, but it will help later, I promise. What may be more useful in the context of this tutorial is knowing the number of observations that are TRUE, i.e. how many obseervations occur in 2011.
Recall that logical values are also referred to as “binary” values. This is because, under the hood, a FALSE is represented numerically as 0, while a TRUE is represented numerically as 1. Observe:
TRUE == 1
## [1] TRUE
If that’s the case, in theory, the sum of all TRUE values, each representing 1, will also provide the total count of observations which satisfy the logical statement. So how many observations occur in 2011? We can determine this with function sum():
sum(poverty$Year == 2011)
## [1] 52
There are 52 observations in 2011. Pretty neat!
The following provides a list of challenges and instructions using logical statements. Some are thinking exercises to hone your logic, while others are applied to local data on “Educational Attainment” in Onondaga County from CNY Vitals Pro. Use the following code to read in object education:
url <- "https://tinyurl.com/ya7xccbn"
education <- read.csv(url, stringsAsFactors = FALSE)
rm(url)
Instructions: The following is a list of logical statements. Using the information provided in this tutorial and your own skill to carefully consider which of the following statements evaluate to either TRUE or FALSE. Run the code in your R console to check your answers.
20 > 50!(30 > 40)33 * 3 <= 100TRUE | FALSE(TRUE | 22 > 75) & !(150 != 150)Instructions: Use the following prompts for each challenge and the dataset education to determine an answer. Some challenges may ask you to use a new function, like mean() or median(). Recall that you can subset a variable from a dataset using dataset_name$variable_name notation.
Year in dataset education. Use sum() to find the number of observations in after, not during, 2014.Education.Attained:unique(education$Education.Attained)
## [1] "NoSchoolingCompleted" "Nursery"
## [3] "Kindergarten" "1st"
## [5] "2nd" "3rd"
## [7] "4th" "5th"
## [9] "6th" "7th"
## [11] "8th" "9th"
## [13] "10th" "11th"
## [15] "12thNoDiploma" "RegularHighSchool"
## [17] "GED" "SomeCollegeLessThan1yr"
## [19] "SomeCollegeMoreThan1yr" "AssociateDegree"
## [21] "BachelorDegree" "MasterDegree"
## [23] "ProfessionalSchoolDegree" "DoctorateDegree"
Use a logical statement in function filter() to those only observations with “MasterDegree” in variable Education.Attained. Store the results in object masters with the assignment operator (<-). It should start of with masters <- filter(education, Education.Attained ___ _______). Use function mean() on masters$Population.Sum to find the average number of county residents with a Master’s Degree.
Great job! You know have a solid grasp on arithmetic, relational, and logical operators in R! With just a few of the above examples, I hope you can see just how important these concepts are in data analysis and beyond.