1 Introduction:

This markdown is meant to be a quick introduction to get students up to speed with basic R concepts. Many of the concepts covered here were pulled from some of the operations I use in everyday programming. My hope is that this script will be a quick (but not necessarily comprehensive) introduction to get students up to speed with programming in R. I will be including the Rmarkdown script to this HTML document so that students can experiment with different code chunks, and really practice the concepts that I will be presenting.

This markdown document will be available to view and download from my github while in production. Just keep in mind that just because a topic isn’t covered on this document doesn’t mean that it won’t be covered in the future. I’m considering this as a working document, and it likely will be for a very long time

2 Basic Arithmetic

10 + 7
## [1] 17
10 - 7
## [1] 3
10 * 7
## [1] 70
10 / 7
## [1] 1.428571
10 ** 7
## [1] 1e+07
10 ^ 7
## [1] 1e+07
10 %/% 7 # modulus is floor division. It takes the result of 10/7 and rounds down to the next integer
## [1] 1
10 %% 7 # %% is the reverse of the modulus, as it returns the remainder of 10/7
## [1] 3

3 Storing data in variables

In R, you can use two different operators to store data in the form of a variable. You could go for the traditional = to store information, like saying x = 15*25. However, best practices in R is to use the assignment operator, <-, as in x <- 15*25. Intuitively, this can be read as "Storing the result of 15*25 into the variable x".

x = 15*25
print(x)
## [1] 375
x <- 15*25
print(x)
## [1] 375
## same result

However, variables can store much more than single numbers. Variables can store just about anything in R. Variables can store things like formulas: outcome ~ pred1 + pred2 + pred3 used for making models. You can also store strings, lists, factors, data frames, vectors, ggplots and complicated visuals. Almost anything you can imagine in R can be stored as a variable.

4 Data types

R has very multiple data types. Some common ones include strings/character data ("Things surrounded by quotes"), factor/categorical data, numeric/double, integer, date, and boolean.

# examples of making different data types
string <- "simple string"
print(string)
## [1] "simple string"
numeric <- 1
print(numeric)
## [1] 1
double <- 3.141592
print(double)
## [1] 3.141592
date <- Sys.Date()
print(date)
## [1] "2021-04-21"
factor_data <- as.factor(c("King", "Queen", "Rook", "King", "Queen", "Rook", "Pawn", "Pawn", "Bishop"))
table(factor_data)
## factor_data
## Bishop   King   Pawn  Queen   Rook 
##      1      2      2      2      2
bool <- 1 == 1
bool
## [1] TRUE
otherbool <- 1 > 5
otherbool
## [1] FALSE
literalbool <- TRUE
literalbool
## [1] TRUE

4.1 Vectors

Vectors can be considered as 1-dimensional collections of data. By 1-dimensional, I mean that a vector can either have 1 row and many columns, or one column and many rows (it really doesn’t make a difference). There are a bunch of ways to make vectors in R, but the two most common ones are the c() function for combining data elements, and the vector() function, which actually lets you make much more than just a vector. vector() also allows you to specify which data type you will store in the vector, and is good for making empty vectors of a specified size, which you can populate with for or while loops.

vec1 <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
vec1
##  [1]  1  2  3  4  5  6  7  8  9 10
# you can also make this vector much more simply with the seq() function
vec2 <- seq(1, 10, by = 1)
vec2
##  [1]  1  2  3  4  5  6  7  8  9 10
vec1 == vec2
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
vector(mode = "integer", length = 50) # this creates an empty vector that you can populate later
##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [39] 0 0 0 0 0 0 0 0 0 0 0 0

4.1.1 Indexing Vectors

Indexing is relatively straightforward in R. Elements in a data structure are intuitively indexed from 1 to however many data elements are in that structure. For example, a vector holding the letters of the alphabet would be indexed with “a” having an index of 1, “b” with an index of 2, and “c” having an index of 3…

We can easily index elements with square brackets placed after the variable that a vector is stored in:

vec1 <- seq(500, 1, by = -15)
print(vec1)
##  [1] 500 485 470 455 440 425 410 395 380 365 350 335 320 305 290 275 260 245 230
## [20] 215 200 185 170 155 140 125 110  95  80  65  50  35  20   5
vec1[1] # the first element
## [1] 500
vec1[length(vec1)] # the last element
## [1] 5
vec1[c(1, length(vec1))] # get multiple elements by combining indices with c()
## [1] 500   5

You can also specify a range of elements that you want to index

vec1[5:25]
##  [1] 440 425 410 395 380 365 350 335 320 305 290 275 260 245 230 215 200 185 170
## [20] 155 140
# or we can grab every other index
vec1[seq(5, 25, by = 2)]
##  [1] 440 410 380 350 320 290 260 230 200 170 140
# what does this seq function do?
seq(5, 25, by = 2)
##  [1]  5  7  9 11 13 15 17 19 21 23 25
# here, we see it makes a sequence of numbers from 5 to 25, counting by 2

# or we can remove an element by its index
vec3 <- vec1[-c(15, 25, 30)]
vec1
##  [1] 500 485 470 455 440 425 410 395 380 365 350 335 320 305 290 275 260 245 230
## [20] 215 200 185 170 155 140 125 110  95  80  65  50  35  20   5
vec3
##  [1] 500 485 470 455 440 425 410 395 380 365 350 335 320 305 275 260 245 230 215
## [20] 200 185 170 155 125 110  95  80  50  35  20   5

We can also index using boolean values (TRUE/FALSE). How booleans work is that you provide a comparison or operation such as greater than >, less than <, greater than or equal to >=, less than or equal to <=, equal to ==, not equal to !=, is missing (or NA) is.na(), is not missing (or NA) !is.na(), etc… that compares two pieces of data based on a certain condition. Notice that the exclamation point (called a “bang” in R terminology), simply negates whatever operation follows it (this can be extremely useful in letting the user define their own operations). I’ll include some simple booleans below

x <- NA
y <- 5
z <- 10

is.na(x)
## [1] TRUE
!is.na(x)
## [1] FALSE
y < z
## [1] TRUE
y > z
## [1] FALSE
y <= z
## [1] TRUE
y >= z
## [1] FALSE
y == z
## [1] FALSE
y != z
## [1] TRUE

The cool thing that we can do with these boolean TRUE/FALSE values, is that we can use them to index our data with specific conditions! If we place a boolean comparison within [], R will evaluate whether elements in the data structure are TRUE or FALSE, keeping the TRUE values, and removing the FALSE values.

First I’ll show you what happens when we apply a boolean comparison to a vector:

vec1 < 250
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
vec1
##  [1] 500 485 470 455 440 425 410 395 380 365 350 335 320 305 290 275 260 245 230
## [20] 215 200 185 170 155 140 125 110  95  80  65  50  35  20   5

Now, if we use vec1 < 250 as an indexer, we’ll get a vector returned that should only have the values of 245 to 5

vec1[vec1 < 250]
##  [1] 245 230 215 200 185 170 155 140 125 110  95  80  65  50  35  20   5

Indexing is extremely useful, as it opens up the ability for us to select subsets of our data, clean data, and view specific portions of our data that we are interested in. We can also split our data into groups for subgroup analyses. I’ll just show a couple other indexing operations

vec1[vec1 > 250]
##  [1] 500 485 470 455 440 425 410 395 380 365 350 335 320 305 290 275 260
vec1[vec1 != 245]
##  [1] 500 485 470 455 440 425 410 395 380 365 350 335 320 305 290 275 260 230 215
## [20] 200 185 170 155 140 125 110  95  80  65  50  35  20   5
vec1[is.na(vec1)]
## numeric(0)
# you can also chang the value of an element in a vector using the assignment operator

vec1[vec1 == 245] <- 246 # a fun way to mess with friends
vec1
##  [1] 500 485 470 455 440 425 410 395 380 365 350 335 320 305 290 275 260 246 230
## [20] 215 200 185 170 155 140 125 110  95  80  65  50  35  20   5

If you are ever confused about what your index will return, simply run the boolean statement by itself, and view the TRUE/FALSE values to see if your index would keep what you want. As you can see, none of our vec1 values are NA, so our last index returned an empty numeric vector.

4.2 Lists

Lists are interesting as a data structure, because they can be used to collect a bunch of different data types into one storage unit. Whereas the vectors I’ve made have only contained numbers, lists can contain numbers, boolean values, other lists, data frames, plots and figures, functions, formulas, and pretty much anything else you could want. Let’s make a wonky list

mylist <- list(FALSE, 100, outcome ~ first + second + third, "random string for good measure", 1:500, mean(1:500))

# you can also give names to list attributes. Here, I'll make the same list but each element will have a name

mylist_named <- list(bool = FALSE, hund = 100, formula = outcome ~ first + second + third, string = "random string for good measure", range = 1:500, mean = mean(1:500))

4.2.1 Indexing lists

Indexing lists is a little bit different than with simple vectors or data frames, which we will cover next. You can pull elements from lists using a double bracket operator [[]]. This double bracket explicitly takes a single object from a list and returns it to you. You can either use the numerical index of the object you want, or you can use the explicit name that the object is given in the list. I’ll show you two ways of getting the same thing

mylist_named[[2]]
## [1] 100
mylist_named[["hund"]]
## [1] 100

In this way, you can access elements of your list either using a numerical index, or the key that is provided for that list. There is also one more way to extract elements from a list, and that is with the convenient $ operator. This operator also lets you use autocomplete if you aren’t sure of the name of the thing that you want in your list

mylist_named$bool
## [1] FALSE
mylist_named$formula[[3]][[2]]
## first + second

Lists are extremely important to know how to work with, as many functions in R return output in the form of a list. Some examples are the summary function, the lm and glm functions for making models. You can also reassign portions of lists using indexing and the assignment operator

mylist_named$formula <- outcome ~ three + two + one
mylist_named$formula
## outcome ~ three + two + one

4.3 Data frames

Data frames can be conceptualized as the equivalent of an excel spreadsheet in R. If you want a more complex explanation, Data frames are a list of column vectors, where each vector has a name (this has implications for indexing data frames later). Let’s stick with the first explanation for now, as it’s less confusing and most people have worked with excel before. Creating data frames are simple, but often you will get a data frame from a .csv file or off of the internet rather than making them yourself

random_data <- data.frame(letters = c("a", "b", "c", "d"),
                          numbers = c(1, 2, 3, 4),
                          bools = c(T, F, T, F),
                          factor = as.factor(c("High", "Medium", "Low", "Just Right")),
                          date = as.Date(c("2020-01-05", "1000-05-03", "2021-01-01", "1980-05-07")))
random_data
##   letters numbers bools     factor       date
## 1       a       1  TRUE       High 2020-01-05
## 2       b       2 FALSE     Medium 1000-05-03
## 3       c       3  TRUE        Low 2021-01-01
## 4       d       4 FALSE Just Right 1980-05-07

If we consider each column of the data frame to be a single vector that is part of a ‘list’, then we can index columns the same way that we would index elements of a list, except that now, we can also index data frames using single brackets, with [row, column] specification. Note: if you want to extract all rows from a single column, or all columns from a single row, leave the respective row or column element blank.

random_data$bools
## [1]  TRUE FALSE  TRUE FALSE
random_data[['bools']]
## [1]  TRUE FALSE  TRUE FALSE
random_data[, 3] 
## [1]  TRUE FALSE  TRUE FALSE
# note as data frames are 2-dimensional structures, you have to specify what rows you want.
# We want the entire column of bools, which is the third column, but to specify the rows, we just leave
# the rows element blank. In data frames, the structure of indexing is [rows, columns].

If we want specific rows of our data frame, we can do that as well, either using boolean indexing or numbers

random_data[1:3, ]
##   letters numbers bools factor       date
## 1       a       1  TRUE   High 2020-01-05
## 2       b       2 FALSE Medium 1000-05-03
## 3       c       3  TRUE    Low 2021-01-01
random_data[random_data$letters == "a",] # select the row where letters is equal to a, and keep all columns
##   letters numbers bools factor       date
## 1       a       1  TRUE   High 2020-01-05

We can also do some chain indexing, which allows us to first select a column, and then select an element of that column that we’re interested in

random_data
##   letters numbers bools     factor       date
## 1       a       1  TRUE       High 2020-01-05
## 2       b       2 FALSE     Medium 1000-05-03
## 3       c       3  TRUE        Low 2021-01-01
## 4       d       4 FALSE Just Right 1980-05-07
random_data$letters[2]
## [1] "b"
random_data[["numbers"]][3]
## [1] 3
# intuitively, here we are selecting a vector, called numbers, and then getting the third element from it

# or we could keep both column and row indexes in the same index
random_data[2, 3]
## [1] FALSE

4.3.1 Making new variables

We can easily make new variables for data frames with the $ operator

random_data$newvar <- rep(c("silly", "billy"), 2)
random_data[['newvar']] <- rep(c("silly", "billy"), 2)
random_data
##   letters numbers bools     factor       date newvar
## 1       a       1  TRUE       High 2020-01-05  silly
## 2       b       2 FALSE     Medium 1000-05-03  billy
## 3       c       3  TRUE        Low 2021-01-01  silly
## 4       d       4 FALSE Just Right 1980-05-07  billy

we can also change variable names using the names() function

names(random_data)
## [1] "letters" "numbers" "bools"   "factor"  "date"    "newvar"
names(random_data)[3]
## [1] "bools"
names(random_data)[3] <- "TrueFalse"
random_data
##   letters numbers TrueFalse     factor       date newvar
## 1       a       1      TRUE       High 2020-01-05  silly
## 2       b       2     FALSE     Medium 1000-05-03  billy
## 3       c       3      TRUE        Low 2021-01-01  silly
## 4       d       4     FALSE Just Right 1980-05-07  billy

5 Installing and Using Packages

Packages increase the range of functionality in your R scripts. However, if there is something that is extremely simple to do, don’t rely on an external package to do it. Oftentimes packages can change, functions become deprecated, and the world moves on. If you write the code for a function yourself, you don’t have to worry about things changing. That being said, R is very good about maintaining packages and keeping them functional. Just make sure you know what a function does before you use it for getting research results

You only need two functions for package management: install.packages() does what is says it does, and library() activates your package within your R session. An added bonus is the require() function, which checks if the package does or does not exist in your library. If the package doesn’t exist, require returns a FALSE value and a warning, whereas library() throws an error. If you’re sharing packages with colleagues, you can include the following at the top of your code for each package that is required to run your script:

if (!require("packagename")) install.packages("packagename"); library(packagename)

Read left to right in plain English the above statement says: If this package does not exist, install this package, and then activate the package from the library, and if the package already exists in the library, activate it. This line of code will only install a package if it does not exist in the current library, and prevents unwanted updates if someone already has the package installed. That being said, if you want to update a package, just run install.packages() on that package again.

6 Functions

I’ve already used a couple in this demo, but functions are simply large chunks of code, represented by a function name that takes what are called arguments, and do stuff with them.

The general format of a function is funcName(arg1, arg2, arg3...). Arguments are things supplied to the function by the user that either changes the settings of the function or gives the function something to operate on.

6.1 Using built-in functions

There are too many built-in functions to use in R for me to go over, so I’ll just list a couple helpful base functions that I find helpful in looking at data. I recommend looking at the documentation of some of these functions to get an idea of how they work, and maybe you could implement some of them into your scripts

6.2 Creating your own functions

If you can’t find a function that suits your needs, or you want to automate a repetitive piece of your data cleaning, you can simply make your own function. The syntax is simple, as I will show below

addfunc <- function(arg1, arg2){
  output <- arg1 + arg2
  return(output)
}

addfunc(1, 5)
## [1] 6
addfunc(-5, -20)
## [1] -25

The simple addfunc that I made above is about as basic as a function can get, you can also make functions outputting strings, or really anything. To end a function and be sure you get the output that you want, simply use the return() function, and whatever you put inside the brackets of return will be returned as output. Here’s another example of a function that will say hi to you if you give it your name

SayHi <- function(name){
  greeting <- paste0("Hello ", name, ", nice to meet you")
  return(greeting)
}

SayHi(name = "Kees")
## [1] "Hello Kees, nice to meet you"

you can also store the output of a function as a variable with the following syntax:

StoreGreeting <- SayHi(name = "Kees")
print(StoreGreeting)
## [1] "Hello Kees, nice to meet you"

Some details:

  • in the SayHi function, name is a function argument. The function takes a value for the argument of name, and evaluates any values of name inside the function as that value.
  • A simple way to visualize this is to look at the body of the SayHi code. Any location where you see the variable name, the string “Kees” will be substituded there. This is a basic overview of how arguments work inside of functions.
  • Obviously, there are more complicated applications of functions, but once you have the basic syntax, and understand how arguments and the return function works, you’re at a good starting point

6.2.1 Control flow (AKA if, else if, else)

You can use what is called “control flow” to have your script make analysis decisions for you! This is an extremely powerful capability that’s available in every programming language you ever see, so once you learn how to use if-else statements in R, you can use this concept in any other programming language.

If-else statements work as follows:

  • You provide an If statement with a statement that evaluates to TRUE or FALSE
    • If the statement evaluates to TRUE, the code underneath the if statement is executed
    • If the statement evaluates to FALSE, the code underneath the if statement is ignored
  • If the first if statement evaluates to false, the code proceeds to the next chunk, or an else if statement
    • The else if statement behaves the same as the first if statement, however this statement allows you to supply a different condition for your code to execute on. If this new statement evaluates to TRUE, the code below the else if statement is executed. If FALSE, R will ignore the code within that else if statement.
  • Finally, if all of your previous if and else if statements fail, you can conclude your else if chain with an else statement.
    • Intuitively, an else statement can be interpreted as such: “Else, if all of the above conditions that you provided for me evaluate to FALSE, run the code below.”
  • A final point, In an if...else chain, R will evaluate the first TRUE if statement it sees, and ignore all other statements that come after in that if...else chain. Because of this, the order of your if...else statements might be important to consider when you are writing code.

if...else statements are great to include in functions to give your users multiple options for how to evaluate a chunk of code. Let’s adjust our greeting function so that it greets you based on the time of day you provide.

if (!require(lubridate)) install.packages('lubridate');library(lubridate)
## Loading required package: lubridate
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
SayHiTOD <- function(name){
  hour <- hour(as.character(Sys.time())) 
  hour.num <- as.numeric(hour)
  # Sys.time provides the current DATETIME, while hour extracts the hour (0-24) of the current date
  
  if (hour < 12){
    greeting <- paste0("Good Morning ", name, ", I hope you have a great day!")
  } else if (hour >= 12 & hour < 17){
    greeting <- paste0("Good afternoon ", name, ", I hope you're having a great day!")
  } else {
    greeting <- paste0("Good evening ", name, ", I hope you've had a great day!")
  }
  return(greeting)
}

SayHiTOD("Kees")
## [1] "Good evening Kees, I hope you've had a great day!"

The above function still only takes a name as an argument, but it has code in the body that provides the time of the day (with the Sys.time() function). Then, based on the hour extracted from Sys.time(), R evaluates an if statement based on whether it is morning, early afternoon, or evening, and gives you a greeting specific to the time. You could alternatively ask the use to provide the hour as an argument, but you never know if the user might be lying to you…

Later on, when we get to dplyr, I’ll show you how you can make custom functions that take a file, data frame, or other type of data structure with similar elements, and automate the data cleaning process using a function with dplyr verbs. In the end, we’ll make a function that creates summary tables based on the provided data set.

6.2.2 for loops and while loops

For and while loops play off of indexing that I showed you before. However, loops allow you to “loop” through multiple elements with one chunk of code. For example, say you wanted to insert a time series variable into a data frame (that would count from 1 in row 1 to the end of your data frame). You could use a for loop to generate such a variable (there are other ways, but this is for the sake of demonstration).

For loop syntax is very simple, just like functions. You start with a for or while statement, followed by parentheses, showing the range that you are going to loop through, and specifying the index that you are going to use (i in 1:50), and then you have a set of brackets that contains the code you are going to be iterating through. I’ll show you the most basic for loop that you can probably make

for (i in 1:10){
  cat("This is iteration ", i, " of ", 10, "\n")
  Sys.sleep(0.5)
}
## This is iteration  1  of  10 
## This is iteration  2  of  10 
## This is iteration  3  of  10 
## This is iteration  4  of  10 
## This is iteration  5  of  10 
## This is iteration  6  of  10 
## This is iteration  7  of  10 
## This is iteration  8  of  10 
## This is iteration  9  of  10 
## This is iteration  10  of  10

The above for loop just prints a message as to what iteration the loop is on. The way the for loop works is that the loop runs through all of the contained code within the curly braces, and then starts on the next iteration. All the while, i starts as 1, then is 2 in the next loop, then 3, then 4, etc…

As you could imagine, we can use that i, or that index to iterate through a dataframe, list, or anything that has an inherent index to it. Let’s look at the ChickWeight dataset contained within R, which has data on a chick’s weight, the time from birth, the Chick identification number, and the Diet that the chick was placed on. I’ll help us visualize the data with a quick plot

ggplot(data = ChickWeight, aes(x = Time, y = weight, group = Chick, color = Diet)) +
  geom_line() +
  labs(title = "Chickweight Over Time By Diet")

There are a couple ways that we could cycle through a data frame. One way to do this is with the seq_along() function. In this function, you provide a vector, list, or one-dimensional data structure, and this function generates a sequence of indices that matches the provided data structure. For example, we could do seq_along(names(ChickWeight)), and our for loop would iterate from 1 to the number of columns in our dataframe Chickweight

for (col in seq_along(names(ChickWeight))){
  cat("Column ", col, " in ChickWeight is named ", names(ChickWeight)[col], "\n")
}
## Column  1  in ChickWeight is named  weight 
## Column  2  in ChickWeight is named  Time 
## Column  3  in ChickWeight is named  Chick 
## Column  4  in ChickWeight is named  Diet

We could also use the functions nrow and ncol to iterate through the number of columns or the number of rows in a data frame

for (row in 1:nrow(ChickWeight)){
  
  cat(ChickWeight$Diet[row], "; ")
  
}
## 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 1 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 2 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 3 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ; 4 ;

Above, you can see all of the values in the ChickWeight column Diet, separated by semicolons. Obviously there are more useful things that you can do with this, but this is just an example of how you could access elements of a dataframe using a for loop. You could also change values of a dataframe using a for loop just as easily. Let’s say that all of the values for ChickWeight are wrong, and values of 1 should be 4, 2 should be 3, 3 should be 2, and 4 should be 1. We could use a for loop to correct this error

for (i in 1:nrow(ChickWeight)){
  if (ChickWeight$Diet[i] == 4){
    ChickWeight$Diet[i] <- 1
  } else if (ChickWeight$Diet[i] == 3){
      ChickWeight$Diet[i] <- 2
  } else if (ChickWeight$Diet[i] == 2){
      ChickWeight$Diet[i] <- 3
  } else if (ChickWeight$Diet[i] == 1){
      ChickWeight$Diet[i] <- 4
  }
  
}

head(ChickWeight$Diet)
## [1] 4 4 4 4 4 4
## Levels: 1 2 3 4
tail(ChickWeight$Diet)
## [1] 1 1 1 1 1 1
## Levels: 1 2 3 4

Notice that you should be careful with these types of operations above, as if you don’t use the else if statement, the for loop will change values of 4 into values of one, and then when they reach the last statement, they will change the values of 1 back into a 4. It is important that you string together these statements utilising else if.

I will only cover while loops for a little bit, as I’ve barely found the need to use a while loop in my scripts. While loops are like for loops, except that while loops continue looping until a provided condition is met, or a break statement is implemented. Let me show you what I mean

i <- 0
while (i < 15) {
  print(i)
  i <- i + 1
}
## [1] 0
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
## [1] 12
## [1] 13
## [1] 14

Here you see with while loops that our index, i was printed continuously, until we reached 15, after which our while loop terminated. The intuitive way to read a while loop is through the following steps

  • read the condition next to the while statement. If the condition evaluates to TRUE, the code in the while loop will execute
  • Go back to the top of the while loop. Is the condition still TRUE? if so, the code will be run again. If FALSE, the while loop will terminate.

Another way that while loops are different from for loops are that we have to explicitly create an iterator. If we hadn’t defined i above the while loop, our while loop wouldn’t know what we’re talking about when we tell it to print I.

One other thing, within the while loop, we have to do something to change the iterator. If we didn’t have the i <- i + 1 statement in our loop, our loop would never end, which is super annoying and has happened to me multiple times. Don’t write infinite while loops…please…

Another way that you could stop a while loop is with a break statement, which causes your code to terminate the while loop, regardless of whether the while loop has met its condition or not. Sometimes you don’t know how long your while loop will run, so you can use some if...else logic and a break statement to cut off your while loop whenever a condition is met. Check out this example

i <- 1
while (TRUE){
  r <- runif(1, min = 0, max = 50000)
  
  if (r > 49950){
    print(r)
    cat("We got greater than 49950 in ", i, " iterations")
    break
  }
  
  i <- i + 1
}
## [1] 49986.98
## We got greater than 49950 in  358  iterations

runif produces a random number from a uniform distribution between min and max, break stops our while loop whenever we get a number greater than 49950, and the while (TRUE) syntax just tells our while loop to run indefinitely until we hit a break.

This brings up another point with if...else and while statements. If you provide just a number that is either 1 or greater, that number will always be interpreted as a true statement, while 0 is evaluated as a FALSE. This has some uses if you want to stop a loop on a zero, or just want to make sure that you have a number stored in a value before proceeding through the rest of your loop.

6.3 Code Snippets for saving time

R has what are called “snippets,” which are pre-typed sections of code syntax that you can pull up with auto-complete. some common snippets to use can write out if...else statements, function and loop syntax. All you have to do is type the first couple letters of for, while, function, and other statements, and hit the tab button, and autocomplete will do the rest. Here’s what these snippets will give you

for (variable in vector) {# for
  
}

while (condition) {# whil
  
}

name <- function(variables) {# func
  
}

if (condition) { # if
  
} else if (condition) { # ei
  
} else { # el
  
}

#lap
lapply(list, function)
#sap
sapply(list, function)
# lib
library(package)
# req
require(package)

# you can even make your own code snippets. For example, I made one called 'header'
# which creates a reproducible header that I can put at the top of every script

# here's my header snippet


## Script_Name.R             
## Programmer: Kees Schipper 
## Created: 2021-04-04       
## Last Updated: 2021-04-04
############################################################
## Clean up global environment
dev.off()
rm(list = ls())
############################################################
## Set up Packages
library(tidyverse)


############################################################
## Notes: Type notes for TODOs and usage of code here

############################################################
## Set your working directory
setwd()

# one more snippet I made that (1) checks if a package is in your library; (2) if the package is
# not in your library, it installs the package, and (3) activates the package in your lib
if (!require()) install.packages();library() # checkpkg

# This code is easy to find on the internet and is great if you're sharing code with
# others and they may or may not have some packages needed to run your scripts

7 Base Plotting

7.1 The plot function

7.2 The boxplot function

7.3 The hist function

8 Package Spotlights

8.1 dplyr

dplyr is probably one of the most used R packages because of its ease of use in wrangling data. If you aren’t already using dplyr to do your data cleaning and management, I would highly recommend learning it. It’s much easier and intuitive than using base R to clean large data sets.

In addition, dplyr functions act as verbs, showing that you are doing something to the data that you are working with. For example, you are either using select to select columns, filter to filter rows, mutate to create new variables, arrange to arrange your data by some order, group_by to group your analysis by a certain categorical variable, summarize to summarize your data into easy-to-digest summary tables, or rename to rename variables in your data set. We’ll go through how to use each function below, but first, we need some data, which we can get from the tidytuesdayR package.

## Loading required package: tidytuesdayR
## --- Compiling #TidyTuesday Information for 2021-01-19 ----
## --- There are 3 files available ---
## --- Starting Download ---
## 
##  Downloading file 1 of 3: `households.csv`
##  Downloading file 2 of 3: `crops.csv`
##  Downloading file 3 of 3: `gender.csv`
## --- Download complete ---
## Available datasets:
##  households 
##  crops 
##  gender 
##  

Before we start with dplyr functions (which I’ll refer to interchangably as functions or verbs), we should know how to use what is called the “pipe” operator, of %>%. This operator allows us to connect multiple dplyr verbs together to have a streamlined process of data cleaning all done in one run of our code. This reduces the number of lines we have to write, and creates an easy to read list of verbs, showing what exactly we are doing to our data.

%>% is a function in and of itself, what it does is it takes what is on its left side, and passes it to the first argument of a function that appears on its right side, or on the next line of code. An example is shown below:

# %>% or the pipe operator ------------------------------------------------
# f(x) is the same thing as x %>% f()

x <- c(1, 2, 3, 4, 5)
sum(x)
## [1] 15
x %>% sum()
## [1] 15

as we can see, x %>% sum() is equivalent to sum(x), because the pipe is “piping” x into the function. We can also specify that we want to place x, or some other variable, in another argument of the function, as by default, the pipe passes what’s on its left into the first function argument on the right. What we do is put a dot, or . in the function call on the right where we want to put our piped variable.

x %>% mean(na.rm = T, .)
## [1] 3

8.1.1 Select

Later on, I’ll show some examples where we use multiple lines to clean some data, but let’s start learning how to use the dplyr select statement.

select, as mentioned above, allows the user to select columns from a data frame. The user can select columns either by their name (not in quotes), their column index, or a mix of both. You can also select slices of columns with a structure of firstcol:lastcol. This structure will select the first column, and all columns up to the last column based on position. The same can be done with numerical indices. Let’s start by looking at the names of our House dataset

# select and helper functions ---------------------------------------------
# look at the column names in a data set
names(House)
## [1] "County"               "Population"           "NumberOfHouseholds"  
## [4] "AverageHouseholdSize"

Now let’s select some columns without the pipe

conflicted::conflict_prefer("select", winner = "dplyr")
## [conflicted] Will prefer dplyr::select over any other package
conflicted::conflict_prefer("filter", winner = "dplyr")
## [conflicted] Will prefer dplyr::filter over any other package
# basic use of the select function. Get columns for pop and households
select(.data = House, Population, NumberOfHouseholds)
## # A tibble: 48 x 2
##    Population NumberOfHouseholds
##         <dbl>              <dbl>
##  1   47213282           12143913
##  2    1190987             378422
##  3     858748             173176
##  4    1440958             298472
##  5     314710              68242
##  6     141909              37963
##  7     335747              96429
##  8     835482             141394
##  9     775302             127932
## 10     862079             125763
## # ... with 38 more rows

Now let’s make the same selection, but this time using a pipe

# same as the above code but with a pipe
House %>% 
  select(County, Population)
## # A tibble: 48 x 2
##    County           Population
##    <chr>                 <dbl>
##  1 "Kenya   "         47213282
##  2 "Mombasa   "        1190987
##  3 "Kwale  "            858748
##  4 "Kilifi  "          1440958
##  5 "TanaRiver   "       314710
##  6 "Lamu   "            141909
##  7 "Taita/Taveta  "     335747
##  8 "Garissa   "         835482
##  9 "Wajir  "            775302
## 10 "Mandera   "         862079
## # ... with 38 more rows

Slicing columns

# viewing but not storing output. Using `:` to specify a range of columns
House %>%
  select(County:NumberOfHouseholds)
## # A tibble: 48 x 3
##    County           Population NumberOfHouseholds
##    <chr>                 <dbl>              <dbl>
##  1 "Kenya   "         47213282           12143913
##  2 "Mombasa   "        1190987             378422
##  3 "Kwale  "            858748             173176
##  4 "Kilifi  "          1440958             298472
##  5 "TanaRiver   "       314710              68242
##  6 "Lamu   "            141909              37963
##  7 "Taita/Taveta  "     335747              96429
##  8 "Garissa   "         835482             141394
##  9 "Wajir  "            775302             127932
## 10 "Mandera   "         862079             125763
## # ... with 38 more rows

storing your values in a data frame with the <- operator

# storing your piped operations into a new data frame
House_cleaned <- House %>%
  select(County:NumberOfHouseholds)

You can negate columns by putting the - sign in front of your selection

# negative selection: read as "I want to remove columns 1 through 3 from my data set
House %>%
  select(-c(1:3))
## # A tibble: 48 x 1
##    AverageHouseholdSize
##                   <dbl>
##  1                  3.9
##  2                  3.1
##  3                  5  
##  4                  4.8
##  5                  4.6
##  6                  3.7
##  7                  3.5
##  8                  5.9
##  9                  6.1
## 10                  6.9
## # ... with 38 more rows

You can also combine operations by using multiple pipes

# use pipes to string together multiple functions (rm avghousesize and rename population)
House %>%
  select(-AverageHouseholdSize) %>%
  rename(Pop = Population)
## # A tibble: 48 x 3
##    County                Pop NumberOfHouseholds
##    <chr>               <dbl>              <dbl>
##  1 "Kenya   "       47213282           12143913
##  2 "Mombasa   "      1190987             378422
##  3 "Kwale  "          858748             173176
##  4 "Kilifi  "        1440958             298472
##  5 "TanaRiver   "     314710              68242
##  6 "Lamu   "          141909              37963
##  7 "Taita/Taveta  "   335747              96429
##  8 "Garissa   "       835482             141394
##  9 "Wajir  "          775302             127932
## 10 "Mandera   "       862079             125763
## # ... with 38 more rows

Can also change the order of columns by selecting the columns you want to be ordered, and then selecting everything else

# reorder rows by specifying from left-to-right the order of rows that you want
House %>% select(Population, everything())
## # A tibble: 48 x 4
##    Population County           NumberOfHouseholds AverageHouseholdSize
##         <dbl> <chr>                         <dbl>                <dbl>
##  1   47213282 "Kenya   "                 12143913                  3.9
##  2    1190987 "Mombasa   "                 378422                  3.1
##  3     858748 "Kwale  "                    173176                  5  
##  4    1440958 "Kilifi  "                   298472                  4.8
##  5     314710 "TanaRiver   "                68242                  4.6
##  6     141909 "Lamu   "                     37963                  3.7
##  7     335747 "Taita/Taveta  "              96429                  3.5
##  8     835482 "Garissa   "                 141394                  5.9
##  9     775302 "Wajir  "                    127932                  6.1
## 10     862079 "Mandera   "                 125763                  6.9
## # ... with 38 more rows

The select function also has a bunch of “selection helpers” that help with more nuanced selection strategies. To read about them, just type ?select in the console, and pick documentation for the dplyr::select function. I’ll show an example below where you can select any_of the columns specified. If some columns do not exist in the data frame, no error is thrown, and the columns that do exist are selected

# selects all columns that match but don't throw an error if one entry doesn't match
House %>% select(any_of(c("County", "notacolumn", "Population")))
## # A tibble: 48 x 2
##    County           Population
##    <chr>                 <dbl>
##  1 "Kenya   "         47213282
##  2 "Mombasa   "        1190987
##  3 "Kwale  "            858748
##  4 "Kilifi  "          1440958
##  5 "TanaRiver   "       314710
##  6 "Lamu   "            141909
##  7 "Taita/Taveta  "     335747
##  8 "Garissa   "         835482
##  9 "Wajir  "            775302
## 10 "Mandera   "         862079
## # ... with 38 more rows
# ?starts_with
# ?ends_with
# ?contains
# ?matches # regular expression matching...maybe a little too complicated to explain
# ?num_range
# ?everything
# ?last_col
# ?all_of
# ?any_of

# ?where # a lot of really cool use-cases with this function. where allows the user to select
#        # columns based on a function. Function must result in TRUE or FALSE outputs to be used
#

8.1.2 Filter: Choosing what rows to include

Filtering is the operation you can use to decide which rows you are going to keep in your data. Filtering works by providing boolean operations to the filter() function, which then keeps values in your data that return a TRUE, and filter out values that return FALSE.

You can also string together conditions with the and & and or | operators. With |, as long as your data satisfy one of the provided conditions, the data value will be kept. With the & operator, your data must satisfy all conditions provided to return a TRUE. A filtering example with | is shown below, where we want population values that are less than a certain number, and are not NA

# boolean/relational operators:
# ?`<` #less thaopulationn
# ?`>` #greater than
# ?`<=` #less than or equal to
# ?`>=` #greater than or equal to
# ?`==` #is equal to
# ?`!=` #is not equal to
# ?between # if you want a faster way of specifying a range
# ?`|` # bitwise "or"
# ?`&` # bitwise "and"
# ?`||` # Or for control flow
# ?`&&` # And for control flow

House_filter <- House %>% # select rows with population greater than a value, and county starting with K
  dplyr::filter(Population > 605415 & grepl(pattern = "^K", County))
head(House_filter, 5)
## # A tibble: 5 x 4
##   County        Population NumberOfHouseholds AverageHouseholdSize
##   <chr>              <dbl>              <dbl>                <dbl>
## 1 "Kenya   "      47213282           12143913                  3.9
## 2 "Kwale  "         858748             173176                  5  
## 3 "Kilifi  "       1440958             298472                  4.8
## 4 "Kitui   "       1130134             262942                  4.3
## 5 "Kirinyaga  "     605630             204188                  3
# can also compare variables in a dataset to set values 
MedPop <- median(House_filter$Population)

House_filter %>%
  filter(Population < MedPop)
## # A tibble: 5 x 4
##   County        Population NumberOfHouseholds AverageHouseholdSize
##   <chr>              <dbl>              <dbl>                <dbl>
## 1 "Kwale  "         858748             173176                  5  
## 2 "Kitui   "       1130134             262942                  4.3
## 3 "Kirinyaga  "     605630             204188                  3  
## 4 "Kajiado    "    1107296             316179                  3.5
## 5 "Kericho   "      896863             206036                  4.4
# for || and &&, if the first condition is not met, the if statement returns a FALSE
# immediatesly, whereas with & and |, both elements of the boolean statement are checked

8.1.3 Mutate: Creating new variables

mutate is probably at the same level of ease as making new variables in base R. You simply pipe your data frame into the mutate function, put the name of the new variable on the left of an equal sign, and calculate the value of that new column on the right. You can also assign strings, factors, or change data types.

You can also use mutate to change one variable to a different data type. One common way this is applied is in changing character data into factor or categorical data with the factor function. You can also change character data to numeric with as.numeric, or vice versa with as.character. Below, we’ll make some simple new variables.

# mutate: for creating new variables --------------------------------------

# create new variables based on old ones
House %>%
  mutate(NumbHouse = Population/AverageHouseholdSize)
## # A tibble: 48 x 5
##    County           Population NumberOfHouseholds AverageHouseholdSize NumbHouse
##    <chr>                 <dbl>              <dbl>                <dbl>     <dbl>
##  1 "Kenya   "         47213282           12143913                  3.9 12105970.
##  2 "Mombasa   "        1190987             378422                  3.1   384189.
##  3 "Kwale  "            858748             173176                  5     171750.
##  4 "Kilifi  "          1440958             298472                  4.8   300200.
##  5 "TanaRiver   "       314710              68242                  4.6    68415.
##  6 "Lamu   "            141909              37963                  3.7    38354.
##  7 "Taita/Taveta  "     335747              96429                  3.5    95928.
##  8 "Garissa   "         835482             141394                  5.9   141607.
##  9 "Wajir  "            775302             127932                  6.1   127099.
## 10 "Mandera   "         862079             125763                  6.9   124939.
## # ... with 38 more rows
# base R version: less neat IMO
House$NumbHouse = House$Population/House$AverageHouseholdSize

Changing County to a factor variable

House %>%
  mutate(County = factor(County)) %>%
  str()
## spec_tbl_df [48 x 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ County              : Factor w/ 48 levels "Baringo   ","Bomet  ",..: 12 29 20 15 41 22 40 7 47 25 ...
##  $ Population          : num [1:48] 47213282 1190987 858748 1440958 314710 ...
##  $ NumberOfHouseholds  : num [1:48] 12143913 378422 173176 298472 68242 ...
##  $ AverageHouseholdSize: num [1:48] 3.9 3.1 5 4.8 4.6 3.7 3.5 5.9 6.1 6.9 ...
##  $ NumbHouse           : num [1:48] 12105970 384189 171750 300200 68415 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   County = col_character(),
##   ..   Population = col_double(),
##   ..   NumberOfHouseholds = col_double(),
##   ..   AverageHouseholdSize = col_double()
##   .. )

We could also change average household size to a factor by rounding, and doing the factor transformation

House_fac <- House %>%
  mutate(Householdfac = factor(round(AverageHouseholdSize, 0)))

str(House_fac)
## spec_tbl_df [48 x 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ County              : chr [1:48] "Kenya   " "Mombasa   " "Kwale  " "Kilifi  " ...
##  $ Population          : num [1:48] 47213282 1190987 858748 1440958 314710 ...
##  $ NumberOfHouseholds  : num [1:48] 12143913 378422 173176 298472 68242 ...
##  $ AverageHouseholdSize: num [1:48] 3.9 3.1 5 4.8 4.6 3.7 3.5 5.9 6.1 6.9 ...
##  $ NumbHouse           : num [1:48] 12105970 384189 171750 300200 68415 ...
##  $ Householdfac        : Factor w/ 5 levels "3","4","5","6",..: 2 1 3 3 3 2 2 4 4 5 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   County = col_character(),
##   ..   Population = col_double(),
##   ..   NumberOfHouseholds = col_double(),
##   ..   AverageHouseholdSize = col_double()
##   .. )
table(House_fac$Householdfac)
## 
##  3  4  5  6  7 
##  8 24 11  4  1

8.1.4 Summarise: creating aggregate summaries

The summarise function takes the data you provide, and collapses that data across multiple aggregation functions like sum, sd, median, etc… This proves extremely useful for making summary tables with a wide range of flexibility in what sorts of statistics you decide to include in your final table. Below, we’ll remove all non-numeric variables from our data frame, and calculate some basic summary statistics.

Summarise is similar to mutate, except that summarise goes one step farther and aggregates, reducing the number of rows in the output compared to the input (usually). Mutate maintains the number of rows.

library(moments) # for skewness and kurtosis


House_new <- House[2:nrow(House), ]

House_new %>%
  summarise(sd_pop = sd(Population),
            kurt_pop = kurtosis(Population),
            med_house = median(NumbHouse),
            avg_hsize = mean(AverageHouseholdSize),
            stderr_pop = sd(Population)/n(),
            N = n(),
            nMiss = sum(is.na(Population)),
            Ncomp = sum(!is.na(Population)))
## # A tibble: 1 x 8
##    sd_pop kurt_pop med_house avg_hsize stderr_pop     N nMiss Ncomp
##     <dbl>    <dbl>     <dbl>     <dbl>      <dbl> <int> <int> <int>
## 1 685504.     13.3   201877.      4.23     14585.    47     0    47
# ?complete.cases()
# ?min()
# ?max()
# ?quantile()

as you can see, our summary statistics don’t have to be of only one variable. Each new variable we create in the summarise function becomes a new column in our output. But what if we don’t want to summarise across our entire data frame? What if we want to compare groups of observations across each other?

8.1.5 group_by: for making group-specific summaries

For group_by, we’re going to look at the starwars dataset, as it has some interesting groups through which we can group:

group_data <- starwars
head(starwars)
## # A tibble: 6 x 14
##   name     height  mass hair_color  skin_color eye_color birth_year sex   gender
##   <chr>     <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Luke Sk~    172    77 blond       fair       blue            19   male  mascu~
## 2 C-3PO       167    75 <NA>        gold       yellow         112   none  mascu~
## 3 R2-D2        96    32 <NA>        white, bl~ red             33   none  mascu~
## 4 Darth V~    202   136 none        white      yellow          41.9 male  mascu~
## 5 Leia Or~    150    49 brown       light      brown           19   fema~ femin~
## 6 Owen La~    178   120 brown, grey light      blue            52   male  mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

group_by doesn’t necessarily do anything to the data itself, it primes the data for analysis. It tells R that within a data frame, there are separate groups that we want to analyse separately, while keeping them together in the same data frame. If we group by eye_color, and calculate the mean mass of individuals in the data set, R will return summary statistics of the mean mass of individuals belonging to each eye color. We can also group by multiple columns, like eye_color and skin_color, which would give us returned masses categorized by eye and skin color combinations:

# group_by creates implicit groups in your data to perform group-wise operations
# remember: group_by doesn't perform any operations on our data, but the groupings
# are applied to operations later on 

group_data %>%
  group_by(eye_color) %>%
  summarise(grp_mass = mean(mass, na.rm = T),
            count = n())
## # A tibble: 15 x 3
##    eye_color     grp_mass count
##    <chr>            <dbl> <int>
##  1 black             76.3    10
##  2 blue              86.5    19
##  3 blue-gray         77       1
##  4 brown             66.1    21
##  5 dark             NaN       1
##  6 gold             NaN       1
##  7 green, yellow    159       1
##  8 hazel             66       3
##  9 orange           282.      8
## 10 pink             NaN       1
## 11 red               81.4     5
## 12 red, blue        NaN       1
## 13 unknown           31.5     3
## 14 white             48       1
## 15 yellow            81.1    11
group_data %>%
  group_by(eye_color, skin_color) %>%
  summarise(grp_mass = mean(mass, na.rm = T),
            count = n())
## `summarise()` has grouped output by 'eye_color'. You can override using the `.groups` argument.
## # A tibble: 53 x 4
## # Groups:   eye_color [15]
##    eye_color skin_color       grp_mass count
##    <chr>     <chr>               <dbl> <int>
##  1 black     green                80.5     2
##  2 black     grey                 78.7     4
##  3 black     none                NaN       1
##  4 black     orange               80       1
##  5 black     red, blue, white     57       1
##  6 black     white, blue         NaN       1
##  7 blue      blue                NaN       1
##  8 blue      brown               136       1
##  9 blue      dark                 50       1
## 10 blue      fair                 90      10
## # ... with 43 more rows

We can also see by the rows of each table that there are 15 unique eye colors, and 53 unique eye and skin color combinations. we can also count the number of observations in each category with the n() function.


That was some basic coverage of the dplyr verbs that you can use to mold your data to how you see fit. Remember that best practices when adjusting your data is to make it so that each column is a unique variable, and each row is a unique observation. Rows should not repeat in a data frame. This takes up extra storage and can mess with analysis results.

As a quick recap, use select to choose your columns by numerical index, name, or select helper functions. filter can help you select rows with boolean operations. rename (which we didn’t cover a lot) simply renames your variables. mutate allows you to create new variables. summarise allows you to aggregate your data into convenient summary tables. group_by lets you take summarise one step further by allowing you to group your summaries by other variables in your data frame.

These dplyr functions are probably the best data wrangling tools that you have in R, and mastery of these tools will probably get you 80-90% of the way through any data cleaning and analysis process. I keep hearing people say that 80% of the work done in data analysis is data cleaning…

8.1.6 Practical Data wrangling example with string data

Often, when given a dataset with many names, the names may be non-standardized, include extra white spaces, or have misspellings that make working with such data difficult. I’ll take us through a practical example of how to clean such data with the rest of the kenya data that we read in earlier. We should have three datasets named Crops, gender, and House, which should all have one observation for each county in Kenya, all containing different information such as number of men and women in a county, yield of specific crops, Population, average household size, and the number of households.

These three datasets have a primary key, which is the variable County (named Subcounty in Crops, which we need to fix…). Let’s first start by making sure that all county columns have the same name with identical capitalization:

Crops_rename <- Crops %>%
  rename(County = SubCounty)
names(Crops_rename)
##  [1] "County"       "Farming"      "Tea"          "Coffee"       "Avocado"     
##  [6] "Citrus"       "Mango"        "Coconut"      "Macadamia"    "Cashew Nut"  
## [11] "Khat (Miraa)"

now we can use some utility functions to see how our county columns look. We can see if there are differences in county names between data frames

unique(Crops_rename$County)
##  [1] "KENYA"           "MOMBASA"         "KWALE"           "KILIFI"         
##  [5] "TANA RIVER"      "LAMU"            "TAITA/TAVETA"    "GARISSA"        
##  [9] "WAJIR"           "MANDERA"         "MARSABIT"        "ISIOLO"         
## [13] "MERU"            "THARAKA-NITHI"   "EMBU"            "KITUI"          
## [17] "MACHAKOS"        "MAKUENI"         "NYANDARUA"       "NYERI"          
## [21] "KIRINYAGA"       "MURANG'A"        "KIAMBU"          "TURKANA"        
## [25] "WEST POKOT"      "SAMBURU"         "TRANS NZOIA"     "UASIN GISHU"    
## [29] "ELGEYO/MARAKWET" "NANDI"           "BARINGO"         "LAIKIPIA"       
## [33] "NAKURU"          "NAROK"           "KAJIADO"         "KERICHO"        
## [37] "BOMET"           "KAKAMEGA"        "VIHIGA"          "BUNGOMA"        
## [41] "BUSIA"           "SIAYA"           "KISUMU"          "HOMA BAY"       
## [45] "MIGORI"          "KISII"           "NYAMIRA"         "NAIROBI"
unique(gender$County)
##  [1] "Total"           "Mombasa"         "Kwale"           "Kilifi"         
##  [5] "Tana River"      "Lamu"            "Taita/Taveta"    "Garissa"        
##  [9] "Wajir"           "Mandera"         "Marsabit"        "Isiolo"         
## [13] "Meru"            "Tharaka-Nithi"   "Embu"            "Kitui"          
## [17] "Machakos"        "Makueni"         "Nyandarua"       "Nyeri"          
## [21] "Kirinyaga"       "Murang'a"        "Kiambu"          "Turkana"        
## [25] "West Pokot"      "Samburu"         "Trans Nzoia"     "Uasin Gishu"    
## [29] "Elgeyo/Marakwet" "Nandi"           "Baringo"         "Laikipia"       
## [33] "Nakuru"          "Narok"           "Kajiado"         "Kericho"        
## [37] "Bomet"           "Kakamega"        "Vihiga"          "Bungoma"        
## [41] "Busia"           "Siaya"           "Kisumu"          "Homa Bay"       
## [45] "Migori"          "Kisii"           "Nyamira"         "Nairobi City"
unique(House$County)
##  [1] "Kenya   "           "Mombasa   "         "Kwale  "           
##  [4] "Kilifi  "           "TanaRiver   "       "Lamu   "           
##  [7] "Taita/Taveta  "     "Garissa   "         "Wajir  "           
## [10] "Mandera   "         "Marsabit   "        "Isiolo   "         
## [13] "Meru   "            "Tharaka-Nithi   "   "Embu  "            
## [16] "Kitui   "           "Machakos  "         "Makueni   "        
## [19] "Nyandarua  "        "Nyeri  "            "Kirinyaga  "       
## [22] "Murang'a  "         "Kiambu   "          "Turkana   "        
## [25] "WestPokot   "       "Samburu  "          "TransNzoia   "     
## [28] "UasinGishu   "      "Elgeyo/Marakwet   " "Nandi  "           
## [31] "Baringo   "         "Laikipia   "        "Nakuru   "         
## [34] "Narok  "            "Kajiado    "        "Kericho   "        
## [37] "Bomet  "            "Kakamega   "        "Vihiga   "         
## [40] "Bungoma   "         "Busia   "           "Siaya  "           
## [43] "Kisumu   "          "HomaBay   "         "Migori   "         
## [46] "Kisii   "           "Nyamira  "          "NairobiCity   "

As we can see, our county names are somewhat messy. The crops df has all caps county names, the gender df looks all-right, but the House df has some whitespace following all county entries. There are two simple things we can do to clean this up. First, we can use the function tolower() to transition the county names to lowercase, and we can use the function trimws() to trim the white space off of the end of the house dataframe.

Because this is a procedure we want to pass to all three data frames, I’ll show you how to functionalize this data cleaning portion so that we can reduce the process to one line of code

# how we would clean the data the hard-coding way
House_lower <- House %>%
  mutate(ClCounty = tolower(County),
         ClCounty = trimws(ClCounty))
House_lower$ClCounty
##  [1] "kenya"           "mombasa"         "kwale"           "kilifi"         
##  [5] "tanariver"       "lamu"            "taita/taveta"    "garissa"        
##  [9] "wajir"           "mandera"         "marsabit"        "isiolo"         
## [13] "meru"            "tharaka-nithi"   "embu"            "kitui"          
## [17] "machakos"        "makueni"         "nyandarua"       "nyeri"          
## [21] "kirinyaga"       "murang'a"        "kiambu"          "turkana"        
## [25] "westpokot"       "samburu"         "transnzoia"      "uasingishu"     
## [29] "elgeyo/marakwet" "nandi"           "baringo"         "laikipia"       
## [33] "nakuru"          "narok"           "kajiado"         "kericho"        
## [37] "bomet"           "kakamega"        "vihiga"          "bungoma"        
## [41] "busia"           "siaya"           "kisumu"          "homabay"        
## [45] "migori"          "kisii"           "nyamira"         "nairobicity"

Because we want to have variable names be consistent across data frames, we can keep ClCounty as the consistent new variable for Clean County names. We then just need to give an option for the user to specify an input data frame, and a variable that we are going to (1) change to lower case and (2) trim the leading and trailing whitespace from. We’ll also arrange our colums in alphabetical order to make comparisons easier. Here’s the function that we could construct

TrimAndLower <- function(df, var) {
  output <- df %>%
    mutate(ClCounty = tolower({{var}}),
           ClCounty = trimws(ClCounty)) %>%
    arrange(ClCounty)
  return(output)
}

The double brackets are the tidyverse’s way of saying “hold off on evaluating the var argument until you get to the double brackets, then evaluate me as a variable within a data frame.” If we didn’t have the double curly braces in our function, R would immediately try to evaluate the var argument, and throw an error, because it wouldn’t know what we are referring to (because the variable by itself isn’t defined in our global environment. It is only defined in the context of our data frame).

Here’s some application of our function:

House_lower <- TrimAndLower(House, County)
Crops_lower <- TrimAndLower(Crops_rename, County)
gender_lower <- TrimAndLower(gender, County)
House_lower$ClCounty
##  [1] "baringo"         "bomet"           "bungoma"         "busia"          
##  [5] "elgeyo/marakwet" "embu"            "garissa"         "homabay"        
##  [9] "isiolo"          "kajiado"         "kakamega"        "kenya"          
## [13] "kericho"         "kiambu"          "kilifi"          "kirinyaga"      
## [17] "kisii"           "kisumu"          "kitui"           "kwale"          
## [21] "laikipia"        "lamu"            "machakos"        "makueni"        
## [25] "mandera"         "marsabit"        "meru"            "migori"         
## [29] "mombasa"         "murang'a"        "nairobicity"     "nakuru"         
## [33] "nandi"           "narok"           "nyamira"         "nyandarua"      
## [37] "nyeri"           "samburu"         "siaya"           "taita/taveta"   
## [41] "tanariver"       "tharaka-nithi"   "transnzoia"      "turkana"        
## [45] "uasingishu"      "vihiga"          "wajir"           "westpokot"
Crops_lower$ClCounty
##  [1] "baringo"         "bomet"           "bungoma"         "busia"          
##  [5] "elgeyo/marakwet" "embu"            "garissa"         "homa bay"       
##  [9] "isiolo"          "kajiado"         "kakamega"        "kenya"          
## [13] "kericho"         "kiambu"          "kilifi"          "kirinyaga"      
## [17] "kisii"           "kisumu"          "kitui"           "kwale"          
## [21] "laikipia"        "lamu"            "machakos"        "makueni"        
## [25] "mandera"         "marsabit"        "meru"            "migori"         
## [29] "mombasa"         "murang'a"        "nairobi"         "nakuru"         
## [33] "nandi"           "narok"           "nyamira"         "nyandarua"      
## [37] "nyeri"           "samburu"         "siaya"           "taita/taveta"   
## [41] "tana river"      "tharaka-nithi"   "trans nzoia"     "turkana"        
## [45] "uasin gishu"     "vihiga"          "wajir"           "west pokot"
gender_lower$ClCounty
##  [1] "baringo"         "bomet"           "bungoma"         "busia"          
##  [5] "elgeyo/marakwet" "embu"            "garissa"         "homa bay"       
##  [9] "isiolo"          "kajiado"         "kakamega"        "kericho"        
## [13] "kiambu"          "kilifi"          "kirinyaga"       "kisii"          
## [17] "kisumu"          "kitui"           "kwale"           "laikipia"       
## [21] "lamu"            "machakos"        "makueni"         "mandera"        
## [25] "marsabit"        "meru"            "migori"          "mombasa"        
## [29] "murang'a"        "nairobi city"    "nakuru"          "nandi"          
## [33] "narok"           "nyamira"         "nyandarua"       "nyeri"          
## [37] "samburu"         "siaya"           "taita/taveta"    "tana river"     
## [41] "tharaka-nithi"   "total"           "trans nzoia"     "turkana"        
## [45] "uasin gishu"     "vihiga"          "wajir"           "west pokot"

Now, each data frame has a column called ClCounty with trimmed whitespaces and all lowercase. Now we need a way to check if there are differences between columns. We can do this with a setdiff function, which returns values from a vector x that are not in a vector y. However, we have to do this both ways to see the values of y that are not in x. We can also make this into a small function

finddiffs <- function(x, y) {
  return(c(setdiff(x, y), setdiff(y, x)))
}

finddiffs(House_lower$ClCounty, Crops_lower$ClCounty)
##  [1] "homabay"     "nairobicity" "tanariver"   "transnzoia"  "uasingishu" 
##  [6] "westpokot"   "homa bay"    "nairobi"     "tana river"  "trans nzoia"
## [11] "uasin gishu" "west pokot"
finddiffs(House_lower$ClCounty, gender_lower$ClCounty)
##  [1] "homabay"      "kenya"        "nairobicity"  "tanariver"    "transnzoia"  
##  [6] "uasingishu"   "westpokot"    "homa bay"     "nairobi city" "tana river"  
## [11] "total"        "trans nzoia"  "uasin gishu"  "west pokot"
finddiffs(gender_lower$ClCounty, Crops_lower$ClCounty)
## [1] "nairobi city" "total"        "kenya"        "nairobi"

Here we can see some common threads that are giving us problems in creating matching names. Certain counties are two words in one data set, and one word in another. Also, nairobi city is also spelled as one continuous word, and again as nairobi. There are a couple of things we can do to solve these problems. (1) We can remove more whitespace by trimming all white space within strings (between words); (2) we can get rid of any instance of the word “city” from our ClCounty column; and (3) we need to convert any instances of “Kenya” to Total, as instances of “Kenya” represent combined statistics of the “Total” country. Let’s do this the non-function way first

House_clname <- House_lower %>%
  mutate(ClCounty = str_replace_all(ClCounty, pattern = c(" " = "", "city" = "")),
         ClCounty = ifelse(grepl("kenya", ClCounty, ignore.case = TRUE), "Total", ClCounty))

House_clname$ClCounty
##  [1] "baringo"         "bomet"           "bungoma"         "busia"          
##  [5] "elgeyo/marakwet" "embu"            "garissa"         "homabay"        
##  [9] "isiolo"          "kajiado"         "kakamega"        "Total"          
## [13] "kericho"         "kiambu"          "kilifi"          "kirinyaga"      
## [17] "kisii"           "kisumu"          "kitui"           "kwale"          
## [21] "laikipia"        "lamu"            "machakos"        "makueni"        
## [25] "mandera"         "marsabit"        "meru"            "migori"         
## [29] "mombasa"         "murang'a"        "nairobi"         "nakuru"         
## [33] "nandi"           "narok"           "nyamira"         "nyandarua"      
## [37] "nyeri"           "samburu"         "siaya"           "taita/taveta"   
## [41] "tanariver"       "tharaka-nithi"   "transnzoia"      "turkana"        
## [45] "uasingishu"      "vihiga"          "wajir"           "westpokot"

We can now functionalize this to make it applicable to the other three data frames

rmSpaceCity <- function(df, var){
  output <- df %>%
    mutate(ClCounty = str_replace_all({{var}}, pattern = c(" " = "", "city" = "")),
           ClCounty = ifelse(grepl("kenya", {{var}}, ignore.case = TRUE), "total", {{var}}))
  return(output)
}

Crops_rmspace <- rmSpaceCity(Crops_lower, ClCounty)
gender_rmspace <- rmSpaceCity(gender_lower, ClCounty)
House_rmspace <- rmSpaceCity(House_lower, ClCounty)

Let’s check if there are any remaining differences in our county names

finddiffs(Crops_rmspace$ClCounty, gender_rmspace$ClCounty)
## character(0)
finddiffs(Crops_rmspace$ClCounty, House_rmspace$ClCounty)
## character(0)

we did it! our finddiffs function didn’t return anything, so we know that the columns for ClCounty are identical across all of our datasets. Now we can merge everything into one combined dataset for analysis. We can do this using some of R’s nifty join functions, namely left_join(), right_join(), inner_join(), and full_join(). left keeps all columns on the left regardless of the right data frame, right does the reverse. Inner joins keep only columns that have matching keys in both data frames, while full join keeps all columns regardless of whether they have matches in one data frame or another. Here, our primary key, or ClCounty is identical across all data frames, so it doesn’t really matter what type of join we use. I generally default to left_join() in this case. The cool thin about the tidyverse is that we can chain these joins together as opposed to having to write them out as separate chunks of code. I would only recommend doing this is you are absolutely sure that your key names are identical. Otherwise, if something goes wrong, it is difficult to find where it happened

Kenya_df <- House_rmspace %>%
  left_join(gender_rmspace, by = "ClCounty") %>%
  left_join(Crops_rmspace, by = "ClCounty")

head(Kenya_df)
## # A tibble: 6 x 22
##   County.x    Population NumberOfHousehol~ AverageHousehold~ NumbHouse ClCounty 
##   <chr>            <dbl>             <dbl>             <dbl>     <dbl> <chr>    
## 1 "Baringo  ~     662760            142518               4.7   141013. baringo  
## 2 "Bomet  "       873023            187641               4.7   185750. bomet    
## 3 "Bungoma  ~    1663898            358796               4.6   361717. bungoma  
## 4 "Busia   "      886856            198152               4.5   197079. busia    
## 5 "Elgeyo/Ma~     453403             99861               4.5   100756. elgeyo/m~
## 6 "Embu  "        604769            182743               3.3   183263. embu     
## # ... with 16 more variables: County.y <chr>, Male <dbl>, Female <dbl>,
## #   Intersex <dbl>, Total <dbl>, County <chr>, Farming <dbl>, Tea <dbl>,
## #   Coffee <dbl>, Avocado <dbl>, Citrus <dbl>, Mango <dbl>, Coconut <dbl>,
## #   Macadamia <dbl>, Cashew Nut <dbl>, Khat (Miraa) <dbl>

Now we have all of these County.suffix variables, which R creates when you join dfs with identical column names that aren’t used as a key. We can get rid of those quickly with the select function and a select helper starts_with()

Kenya_short <- Kenya_df %>%
  select(-starts_with("County", ignore.case = TRUE)) %>%
  select(ClCounty, everything())
# put ClCounty as the first column, since it's an identifier
head(Kenya_short)
## # A tibble: 6 x 19
##   ClCounty  Population NumberOfHouseho~ AverageHousehol~ NumbHouse   Male Female
##   <chr>          <dbl>            <dbl>            <dbl>     <dbl>  <dbl>  <dbl>
## 1 baringo       662760           142518              4.7   141013. 336322 330428
## 2 bomet         873023           187641              4.7   185750. 434287 441379
## 3 bungoma      1663898           358796              4.6   361717. 812146 858389
## 4 busia         886856           198152              4.5   197079. 426252 467401
## 5 elgeyo/m~     453403            99861              4.5   100756. 227317 227151
## 6 embu          604769           182743              3.3   183263. 304208 304367
## # ... with 12 more variables: Intersex <dbl>, Total <dbl>, Farming <dbl>,
## #   Tea <dbl>, Coffee <dbl>, Avocado <dbl>, Citrus <dbl>, Mango <dbl>,
## #   Coconut <dbl>, Macadamia <dbl>, Cashew Nut <dbl>, Khat (Miraa) <dbl>

Now we have a relatively clean dataset that we can work with! Let’s make a couple variables that might be of interest to us, including % breakdowns by gender and percentage breakdowns of agricultural product compared to the total yield

# creating a set of percent gender variables:
Kenya_pct <- Kenya_short %>%
  mutate(pct_male = (Male/Population)*100,
         pct_female = (Female/Population)*100,
         pct_intersex = (Intersex/Population)*100) %>%
  mutate(across(.cols = Tea:`Khat (Miraa)`, ~(./Farming)*100, .names = "pct_{.col}"))

Kenya_pct
## # A tibble: 48 x 31
##    ClCounty Population NumberOfHouseho~ AverageHousehol~ NumbHouse   Male Female
##    <chr>         <dbl>            <dbl>            <dbl>     <dbl>  <dbl>  <dbl>
##  1 baringo      662760           142518              4.7   141013. 336322 330428
##  2 bomet        873023           187641              4.7   185750. 434287 441379
##  3 bungoma     1663898           358796              4.6   361717. 812146 858389
##  4 busia        886856           198152              4.5   197079. 426252 467401
##  5 elgeyo/~     453403            99861              4.5   100756. 227317 227151
##  6 embu         604769           182743              3.3   183263. 304208 304367
##  7 garissa      835482           141394              5.9   141607. 458975 382344
##  8 homabay     1125823           262036              4.3   261819. 539560 592367
##  9 isiolo       267997            58072              4.6    58260. 139510 128483
## 10 kajiado     1107296           316179              3.5   316370. 557098 560704
## # ... with 38 more rows, and 24 more variables: Intersex <dbl>, Total <dbl>,
## #   Farming <dbl>, Tea <dbl>, Coffee <dbl>, Avocado <dbl>, Citrus <dbl>,
## #   Mango <dbl>, Coconut <dbl>, Macadamia <dbl>, Cashew Nut <dbl>,
## #   Khat (Miraa) <dbl>, pct_male <dbl>, pct_female <dbl>, pct_intersex <dbl>,
## #   pct_Tea <dbl>, pct_Coffee <dbl>, pct_Avocado <dbl>, pct_Citrus <dbl>,
## #   pct_Mango <dbl>, pct_Coconut <dbl>, pct_Macadamia <dbl>,
## #   pct_Cashew Nut <dbl>, pct_Khat (Miraa) <dbl>

The percentage of farmed crops will not add up to 100%, as there are some miscalculations in the data, where the components of farmed crops add up to be more than the total population of farmers. If we wanted to plot gender data, it would be helpful to change the format from wide to long formats, so that all of the types of gender are in one column, and the percentage of gender is in another. We can do this using dplyr’s pivot_longer() function, which takes as input (1) the columns you want to pivot; (2) then name of the column that you want to place the variable names in, and (3) the name of the column that you want to put your values in. Here’s an example below:

Kenya_long <- Kenya_pct %>%
  pivot_longer(cols = c(Male:Intersex),
               names_to = "Gender",
               values_to = "Number") %>%
  filter(ClCounty != 'total')

And now we can visualize differences in population percentages by gender!

Kenya_long %>%
  ggplot(aes(x = reorder(ClCounty, Population), y = Number, fill = Gender)) +
  geom_col(position = "stack") +
  scale_y_continuous(expand = c(0, 0)) +
  coord_flip() +
  theme(axis.text.y = element_text(size = 6)) +
  labs(title = "Population by Gender in Kenya Counties")

8.2 readr, haven, readxl, and other functions for reading data

if (!require(readr)) install.packages("readr");library(readr)
if (!require(haven)) install.packages('haven');library(haven)
## Loading required package: haven
if (!require(readxl)) install.packages('readxl');library(readxl)
## Loading required package: readxl

9 Exploratory Data Analysis-Visualizing your data

9.1 Base R

9.2 ggplot2

# TODO: clean up ggplot2 section so that it reads like the rest of the notebook
# clear environment and plotting window -----------------------------------

rm(list = ls())
dev.off()
## null device 
##           1
# set working directory to the Data Days output folder, or wherever your data is stored
setwd("C:/Users/keess/Box/ASA Share/ASA 2020/ASA Meetings/Data Days Spring 2021/Data Days output")

library(tidyverse)

# The Diamonds dataset ---------------------------------
set.seed(123)
df <- slice_sample(diamonds, n = 5000, replace = TRUE)
names(df)
##  [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
##  [8] "x"       "y"       "z"
# simple scatterplots of diamond price by carat
df %>%
  ggplot(aes(x = carat, y = price)) +
  geom_point()

# maybe we can make points transparent so we see where they concentrate
p <- df %>%
  ggplot(aes(x = carat, y = price)) +
  geom_point(alpha = 0.1)

# there seems to be a trend. We could look at the trend with a geom_smooth
p_smooth <- p + geom_smooth(method = 'loess', se = F)
p_smooth
## `geom_smooth()` using formula 'y ~ x'
# in addition, we can break this plot down into different 'facets' by another variable
p_smooth + facet_wrap(~cut)
## `geom_smooth()` using formula 'y ~ x'
# here we split our plot up by different cuts. I've also shown the layered nature of ggplot,
# where graphing elements are added one on top of the other. If we were to write that
# whole plot out using code (and adding labels) it would look like this:
df %>%
  ggplot(aes(x = carat, y = price)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = 'loess', se = F) +
  facet_wrap(~cut) +
  labs(title = "Price vs. Carat, Faceted by Diamond Cut",
       x = 'Carat', y = 'Price', subtitle = "ggplot tutorial for EDA",
       caption = "By Kees Schipper")
## `geom_smooth()` using formula 'y ~ x'
# Now let's do the same layered process for a boxplot ---------------------

# start by simply looking at the distribution of price in our entire dataset
df %>%
  ggplot(aes(y = price)) +
  geom_boxplot()

# clearly not very informative. Let's split price into cut
df %>%
  ggplot(aes(x = cut, y = price)) +
  geom_boxplot()

# this gives us a little more information, but we can improve on our boxplot
# aesthetics with notches and outlier colors
df %>%
  ggplot(aes(x = cut, y = price)) +
  geom_boxplot(notch = TRUE, outlier.color = 'blue', fill = 'grey50', color = 'blue')

# we can increase the dimensions accross which we visualize our data by adding
# facets again.

df %>%
  ggplot(aes(x = cut, y = price)) +
  geom_boxplot(outlier.color = 'blue') +
  facet_wrap(~clarity)

# here our notches aren't especially useful, and we also see that labels tend to overlap
# we could try flipping our axes
df %>%
  ggplot(aes(x = cut, y = price)) +
  geom_boxplot(outlier.color = 'blue') +
  facet_wrap(~clarity) +
  coord_flip()

# this is still a little confusing but at least the labels don't overlap
# finally, we can add labels. And let's keep the coordinates as they were originally,
# but we can use the ggplot theme to tilt our x labels
df %>%
  ggplot(aes(x = cut, y = price)) +
  geom_boxplot(outlier.color = 'blue') +
  facet_wrap(~clarity) +
  theme(axis.text.x = element_text(hjust = 0.5, angle = 45)) +
  labs(title = "Price of Different Diamond Cuts, Stratified by Clarity")

# I almost forgot! We can also add summary statistics to our boxplot with stat_summary()
# we can also look at adding jitter so that we can see individual data points in our plots
df %>%
  ggplot(aes(x = cut, y = price)) +
  geom_boxplot(outlier.color = 'blue') +
  stat_summary(fun = 'mean', col = 'red', size = 0.2) +
  geom_jitter(color = 'brown', alpha = 0.10) +
  facet_wrap(~clarity) +
  theme(axis.text.x = element_text(hjust = 1, angle = 45)) +
  labs(title = "Price of Different Diamond Cuts, Stratified by Clarity")
## Warning: Removed 5 rows containing missing values (geom_segment).

## Warning: Removed 5 rows containing missing values (geom_segment).

## Warning: Removed 5 rows containing missing values (geom_segment).

## Warning: Removed 5 rows containing missing values (geom_segment).

## Warning: Removed 5 rows containing missing values (geom_segment).

## Warning: Removed 5 rows containing missing values (geom_segment).

## Warning: Removed 5 rows containing missing values (geom_segment).
## Warning: Removed 4 rows containing missing values (geom_segment).
# Now we can look at histograms and density plots -------------------------

# histograms are somewhat easy to implement in ggplot
df %>%
  ggplot(aes(x = price)) +
  geom_histogram(bins = 200) # specify number of bins to increase detail in distribution

# like with other plots, we can color, facet, etc... and we can also stack
# different categories using the position argument
df %>%
  ggplot(aes(x = price, fill = cut)) +
  geom_histogram(bins = 200, position = 'stack')

# this visualization isn't super useful, as we can't see the proportions super
# well. We can change this by using the position = 'fill' option
df %>%
  ggplot(aes(x = price, fill = cut)) +
  geom_histogram(bins = 200, position = 'fill')
## Warning: Removed 5 rows containing missing values (geom_bar).
# now we can see the relative proportions of what diamonds make up which price
# however, the jitteriness of this plot still isn't fantastic. This is where
# density plots come in
df %>%
  ggplot(aes(x = price, fill = cut)) +
  geom_density(alpha = 0.25, color = "black")

# here we can visualize the smoothed shape of multiple distributions. However,
# it's hard to see the different distributions when they're one on top of the other
df %>%
  ggplot(aes(x = price, fill = cut)) +
  geom_density(position = 'stack', color = "black") +
  facet_wrap(~cut)

# now we can see the proportions of different diamonds making up different prices
# but there's still a better way if we're only interested in proportions
df %>%
  ggplot(aes(x = price, fill = cut)) +
  geom_density(position = 'fill', color = "black")

# Interestingly, all diamond cuts have a dip in the price range of around $4000-5000.
# Maybe this is because jewlers tend to start charging at around $5000, and don't bother
# with mid-range prices.

# A geom halfway between histograms and the densit plot is the 'geom_freqpoly' which
# makes a frequency polygon with jagged edges matching a given number of bins
df %>%
  ggplot(aes(x = price, color = cut)) +
  geom_freqpoly(bins = 50, size = 1) +
  scale_color_hue()


# some other geoms that may be worth trying -------------------------------

# geom hex
df %>%
  ggplot(aes(x = carat, y = price)) +
  geom_hex() +
  scale_fill_viridis_c()

# geom_density_2d
df %>%
  ggplot(aes(x = carat, y = price)) +
  geom_density_2d()

# geom_density_2d_fill
df %>%
  ggplot(aes(x = carat, y = price)) +
  geom_density_2d_filled() +
  geom_density_2d(color = "black") +
  ylim(0, 6000) +
  xlim(0, 1)
## Warning: Removed 1686 rows containing non-finite values (stat_density2d_filled).
## Warning: Removed 1686 rows containing non-finite values (stat_density2d).
# violin plots
df %>%
  ggplot(aes(x = cut, y = price, fill = cut)) +
  geom_violin() +
  facet_wrap(~clarity)

# ?geom_hline
# ?geom_vline

# for all of the geoms, you can type geom_<TAB> and scroll through the autocomplete
# to see what is available
# In my opinion, the R package "viridis" provides some of the best color schemes taht you could
# use for (1) visibility in both color and grayscale, and (2) colorblind friendliness.
# check out the documentation on the viridis color palettes' usage below:
# https://ggplot2.tidyverse.org/reference/scale_viridis.html

# read in data ------------------------------------------------------------
# load in COVID data for all counties across the united states
load('COVID_master_20200218.RData')

# select a county or state that you're interested in:
MACovid <- COVID_master %>%
            filter(state == "Massachusetts" & Population > 0) %>%
            group_by(date) %>%
            summarise(across(where(is.numeric), ~sum(.x, na.rm = T))) %>%
  select(date, cases, deaths:daily_deaths_100k)
# we now have data for the entire state of Massachusetts. If you want to work with
# a different state, change Massachusetts to whatever state that interests you!


# simple visualizations of the data ---------------------------------------
# let's examine a scatterplot of cases in Massachusetts compared to deaths

# this is about as simple a ggplot as you can get. 
ggplot(data = MACovid, aes(x = daily_cases, y = daily_deaths)) +
  geom_point()

# your ggplot specifies global options, so you don't have to specify aesthetics
# in your added geometries. This has benefits and drawbacks...

# We can also store this as a ggplot object and add to it
p <- ggplot(data = MACovid, aes(x = daily_cases, y = daily_deaths))
# p for plot
p # gives us an empty plot with scales corresponding to our data


# adding to a ggplot object -----------------------------------------------

# let's add some points, and a line showing the relationship of our points:
p +
  geom_point() +
  geom_smooth(se = T, span = 0.2, method = "loess", color = 'red') +
  labs(title = "COVID Deaths vs. Cases")
## `geom_smooth()` using formula 'y ~ x'
# geom_smooth does a simple loess smoother by default, over our data. Note:
# you don't actually need to plot your points to get a smoother, R knows what
# data you are using, so the smoother is only dependent on your data, not the
# plots that you have in ggplot beforehand

# methods of geom_smooth include linear models (lm) loess, and generalized 
# additve models (gam)


# boxplots ----------------------------------------------------------------
# I'm interested in differences of case and death counts by weekday. Let's get a
# variable for that

MAWeekday <- MACovid %>%
  mutate(weekday = factor(weekdays(date)))

summary(MAWeekday$weekday) # now each observation corresponds to a weekday
##    Friday    Monday  Saturday    Sunday  Thursday   Tuesday Wednesday 
##        56        56        56        56        56        56        57
# let's plot case counts by weekday
MAWeekday %>%
  ggplot(aes(x = weekday, y = daily_cases)) +
  geom_boxplot() +
  stat_summary(fun = 'mean', color = 'red', geom = 'point')
# now we have boxplots of cases by weekday. We can also add a mean value indicator
# with stat_summary()

# If we like color, we could color weekdays differently
MAWeekday %>%
  ggplot(aes(x = weekday, y = daily_cases, fill = weekday)) +
  geom_boxplot(outlier.alpha = 0) +
  geom_jitter(color = "blue", alpha = 0.2) +
  stat_summary(fun = 'mean', color = 'red', geom = 'point')

# note the difference between specifying color in the aes() function vs in the general
# gemetry. aes() maps a data value to an inherent aspect of ggplot. Therefore, each
# different value of weekday returns a different color. If you specify color or fill
# outside of aes, that singular color is applied to all of the fills and/or colors for 
# that geometry.


# some other interesting geometries ---------------------------------------
# let's accumulate data by week first
week <- 1
daycount <- 0
MAWeekday$week <- 0
for (i in 1:nrow(MAWeekday)){
  MAWeekday$week[i] <- week
  daycount <- daycount + 1
  
  if (daycount == 7){
    
    daycount <- 1
    week <- week + 1
    
  }
}
# create a count of weeks

MAWeek <- MAWeekday %>%
  select(cases, deaths, daily_cases:week) %>%
  group_by(week) %>%
  summarise(across(is.numeric, sum))
## Warning: Predicate functions must be wrapped in `where()`.
## 
##   # Bad
##   data %>% select(is.numeric)
## 
##   # Good
##   data %>% select(where(is.numeric))
## 
## i Please update your code.
## This message is displayed once per session.
# line graphs
MAWeek %>%
  ggplot(aes(x = week)) +
  geom_line(aes(y = daily_cases), size = 1, col = "blue") +
  geom_line(aes(y = daily_deaths), size = 1, col = "red") +
  labs(title = "Comparing Cases and Deaths in Massachusetts")

# geom histogram
MACovid %>%
  ggplot(aes(x = daily_cases)) +
  geom_histogram(bins = 70)

# can even stack histograms 
MACovid %>%
  pivot_longer(cols = c('daily_cases', 'daily_deaths'),
               names_to = 'outcome',
               values_to = 'values') %>%
  ggplot() +
  geom_histogram(aes(x = values, fill = outcome), position = "stack", bins = 30, col = "blue") 

# not the greatest visualization. Let's use a frequency polygon to compare distributions
MACovid %>%
  pivot_longer(cols = c('daily_cases', 'daily_deaths'),
               names_to = 'outcome',
               values_to = 'values') %>%
  ggplot() +
  geom_freqpoly(aes(x = values, color = outcome), size = 1, bins = 50) +
  labs(title = "Now we can see both distributions")


# can even use a density plot as an overlay -------------------------------

MAWeekday %>%
  ggplot() +
  geom_density(aes(x = daily_cases, group = weekday, color = weekday), size = 1) +
  labs(title = "Daily case count density functions by weekday--Hard to see")


# Let's facet! ------------------------------------------------------------
# faceting lets us separate our plots by a factor or character value
MAWeekday %>%
  ggplot() +
  geom_density(aes(x = daily_cases, group = weekday, color = weekday), size = 1) +
  facet_wrap(~weekday) +
  labs(title = "Faceted case counts by weekday--much clearer!!")

# let's look at other variables
MAWeekday %>%
  ggplot() +
  geom_density(aes(x = parks, group = weekday, color = weekday), size = 1) +
  facet_wrap(~weekday) +
  labs(title = "Faceted case counts by weekday--much clearer!!") +
  theme(legend.position = "none") # after faceting, we don't need a legend anymore!


# Let's look at a heatmap/density map -------------------------------------

MAWeekday %>%
  ggplot(aes(x = daily_cases, y = daily_deaths)) +
  geom_density_2d_filled() +
  geom_density_2d(color = 'black') +
  geom_point(color = 'black', alpha = 0.35) +
  scale_fill_brewer(palette = "Reds") +
  ylim(0, 100) +
  xlim(0, 3000) +
  facet_wrap(~weekday)
## Warning: Removed 87 rows containing non-finite values (stat_density2d_filled).
## Warning: Removed 87 rows containing non-finite values (stat_density2d).
## Warning: Removed 87 rows containing missing values (geom_point).
# Final note on faceting --------------------------------------------------
# if you want to facet your data by a category, your data needs to be in the
# following format (but not in this order):
# x column | y column | faceting column
# because of this, often times you will have to pivot your data for faceting.
# Let's do this to look at the relationships between deaths, cases, and mobility
# over time

facet_data <- MAWeekday %>%
  pivot_longer(cols = c('retail_rec':'daily_deaths'),
               names_to = 'vars',
               values_to = 'values') %>%
  select(date, vars, values, weekday)

facet_data %>%
  ggplot(aes(x = date, y = values)) +
  geom_line(aes(color = vars))

# see...not the most visible data. Much easier to facet

facet_data %>%
  ggplot(aes(x = date, y = values)) +
  geom_line(aes(color = vars)) +
  facet_wrap(~vars)

# you can also facet by multiple variables. You just need to use `facet_grid` and
# specify the column facet, and the row facet
facet_data %>%
  ggplot(aes(x = date, y = values)) +
  geom_line(aes(color = vars)) +
  facet_grid(cols = vars(vars), rows = vars(weekday), scales = 'free_y') +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))




# Further resources -------------------------------------------------------

# for more on visualizations in R, I always point people to the R graph gallery:
# https://www.r-graph-gallery.com/
# if you have an idea for a visualization, there is likely some code already in the
# gallery that can get you started on making your final visualization. If you're stuck
# I recommend checking it out.