Module 1: Introduction to R

In Introduction to R, you will master the basics of this widely used open source language, including factors, lists, and data frames. With the knowledge gained in this course, you will be ready to undertake your first very own data analysis. Oracle estimated over 2 million R users worldwide in 2012, cementing R as a leading programming language in statistics and data science. Every year, the number of R users grows by about 40%, and an increasing number of organizations are using it in their day-to-day activities. Begin your journey to learn R with us today!

1: Intro to basics

Take your first steps with R. In this chapter, you will learn how to use the console as a calculator and how to assign variables. You will also get to know the basic data types in R. Let’s get started.

How it works

In the editor on the right you should type R code to solve the exercises. When you hit the ‘Submit Answer’ button, every line of code is interpreted and executed by R and you get a message whether or not your code was correct. The output of your R code is shown in the console in the lower right corner.

R makes use of the # sign to add comments, so that you and others can understand what the R code is about. Just like Twitter! Comments are not run as R code, so they will not influence your result. For example, Calculate 3 + 4 in the editor on the right is a comment.

You can also execute R commands straight in the console. This is a good way to experiment with R code, as your submission is not checked for correctness.

Calculate 3 + 4

3 + 4
## [1] 7

Calculate 6 + 12

6 + 12
## [1] 18

See how the console shows the result of the R code you submitted? Now that you’re familiar with the interface, let’s get down to R business!

Arithmetic with R

In its most basic form, R can be used as a simple calculator. Consider the following arithmetic operators:

Addition: +
Subtraction: -
Multiplication: *
Division: /
Exponentiation: ^
Modulo: %%

The last two might need some explaining:

The ^ operator raises the number to its left to the power of the number to its right: for example 3^2 is 9. The modulo returns the remainder of the division of the number to the left by the number on its right, for example 5 modulo 3 or 5 %% 3 is 2. With this knowledge, follow the instructions below to complete the exercise.

An addition

5 + 5
## [1] 10

A subtraction

5 - 5 
## [1] 0

A multiplication

3 * 5
## [1] 15

A division

(5 + 5) / 2 
## [1] 5

Exponentiation

2^5
## [1] 32

Modulo

28 %% 6
## [1] 4

Variable assignment

A basic concept in (statistical) programming is called a variable.

A variable allows you to store a value (e.g. 4) or an object (e.g. a function description) in R. You can then later use this variable’s name to easily access the value or the object that is stored within this variable.

You can assign a value 4 to a variable my_var with the command

my_var <- 4

Assign the value 42 to x

x <- 42

Variable assignment (2)

Suppose you have a fruit basket with five apples. As a data analyst in training, you want to store the number of apples in a variable with the name my_apples.

Assign the value 5 to the variable my_apples

my_apples <- 5

Variable assignment (3)

Every tasty fruit basket needs oranges, so you decide to add six oranges. As a data analyst, your reflex is to immediately create the variable my_oranges and assign the value 6 to it. Next, you want to calculate how many pieces of fruit you have in total. Since you have given meaningful names to these values, you can now code this in a clear way:

my_apples + my_oranges

Assign a value to the variable my_oranges

my_oranges <- 6

Add these two variables together

my_apples + my_oranges
## [1] 11

Create the variable my_fruit

my_fruit <- my_apples + my_oranges

Nice one! The great advantage of doing calculations with variables is reusability. If you just change my_apples to equal 12 instead of 5 and rerun the script, my_fruit will automatically update as well. Continue to the next exercise.

Apples and oranges

Common knowledge tells you not to add apples and oranges. But hey, that is what you just did, no :-)? The my_apples and my_oranges variables both contained a number in the previous exercise. The + operator works with numeric variables in R. If you really tried to add “apples” and “oranges”, and assigned a text value to the variable my_oranges (see the editor), you would be trying to assign the addition of a numeric and a character variable to the variable my_fruit. This is not possible.

Basic data types in R

R works with numerous data types. Some of the most basic types to get started are:

Decimal values like 4.5 are examples of the data type numeric. Whole numbers like 3, 4, 0 and -4 are called integers. Integers are also examples of the numeric data type. Boolean values (TRUE or FALSE) are called logical. Text (or string) values are examples of the data type character. Note how the quotation marks indicate that “the text inside quotation marks” is a character.

Change my_numeric to be 42

my_numeric <- 42

Change my_character to be “universe”

my_character <- "universe"

Change my_logical to be FALSE

my_logical <- FALSE

What’s the data type?

If you set the variable my_oranges to have the value 5 and the variable my_oranges to have the value “six”, attempting to add them will give an error. This is due to a mismatch in data types? You can add two (or more) numerics together, but you can’t add a numeric and a character together. Avoid such embarrassing situations by checking the data type of a variable beforehand. You can do this with the class() function, as the code below shows.

Check class of my_numeric

class(my_numeric)
## [1] "numeric"

Check class of my_character

class(my_character)
## [1] "character"

Check class of my_logical

class(my_logical)
## [1] "logical"

This was the last exercise for this chapter. Head over to the next chapter to get immersed in the world of vectors!


2: Vectors

We take you on a trip to Vegas, where you will learn how to analyze your gambling results using vectors in R. After completing this chapter, you will be able to create vectors in R, name them, select elements from them, and compare different vectors.

Create a vector

Feeling lucky? You better, because this chapter takes you on a trip to the City of Sins, also known as Statisticians Paradise!

Thanks to R and your new data-analytical skills, you will learn how to uplift your performance at the tables and fire off your career as a professional gambler. This chapter will show how you can easily keep track of your betting progress and how you can do some simple analyses on past actions. Next stop, Vegas Baby… VEGAS!!

Define the variable vegas

vegas <- "Go!"

Create a vector (2)

Let us focus first!

On your way from rags to riches, you will make extensive use of vectors. Vectors are one-dimension arrays that can hold numeric data, character data, or logical data. In other words, a vector is a simple tool to store data. For example, you can store your daily gains and losses in the casinos.

In R, you create a vector with the combine function c(). You place the vector elements separated by a comma between the parentheses. For example:

numeric_vector <- c(1, 2, 3)
character_vector <- c("a", "b", "c")

Once you have created these vectors in R, you can use them to do calculations.

numeric_vector <- c(1, 10, 49)
character_vector <- c("a", "b", "c")

Complete the code for boolean_vector

boolean_vector <- c(TRUE,  FALSE, TRUE)

Notice that adding a space behind the commas in the c() function improves the readability of your code, but changes nothing else. Let’s practice some more with vector creation in the next exercise.

Create a vector (3)

Poker winnings from Monday to Friday

poker_vector <- c(140, -50, 20, -120, 240)

Roulette winnings from Monday to Friday

roulette_vector <- c(-24, -50, 100, -350, 10)

To check out the contents of your vectors, remember that you can always simply type the variable in the console and hit Enter.

Naming a vector

As a data analyst, it is important to have a clear view on the data that you are using. Understanding what each element refers to is therefore essential.

In the previous exercise, we created a vector with your winnings over the week. Each vector element refers to a day of the week but it is hard to tell which element belongs to which day. It would be nice if you could show that in the vector itself.

You can give a name to the elements of a vector with the names() function. Have a look at this example:

some_vector <- c("John Doe", "poker player") names(some_vector) <- c("Name", "Profession")

This code first creates a vector some_vector and then gives the two elements a name. The first element is assigned the name Name, while the second element is labeled Profession. Printing the contents to the console yields following output:

      Name     Profession 
"John Doe" "poker player" 

Assign days as names of poker_vector

names(poker_vector) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")

Assign days as names of roulette_vector

names(roulette_vector) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")

Naming a vector (2)

If you want to become a good statistician, you have to become lazy. (If you are already lazy, chances are high you are one of those exceptional, natural-born statistical talents.)

In the previous exercises you probably experienced that it is boring and frustrating to type and retype information such as the days of the week. However, when you look at it from a higher perspective, there is a more efficient way to do this, namely, to assign the days of the week vector to a variable!

Just like you did with your poker and roulette returns, you can also create a variable that contains the days of the week. This way you can use and re-use it.

The variable days_vector

days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")

Assign the names of the day to roulette_vector and poker_vector

names(roulette_vector) <- days_vector
names(poker_vector) <- days_vector

A word of advice: try to avoid code duplication at all times. Continue to the next exercise and learn how to do arithmetic with vectors!

Calculating total winnings

Now that you have the poker and roulette winnings nicely as named vectors, you can start doing some data analytical magic.

You want to find out the following type of information:

How much has been your overall profit or loss per day of the week? Have you lost money over the week in total? Are you winning/losing money on poker or on roulette? To get the answers, you have to do arithmetic calculations on vectors.

It is important to know that if you sum two vectors in R, it takes the element-wise sum. For example, the following three statements are completely equivalent:

c(1, 2, 3) + c(4, 5, 6)
c(1 + 4, 2 + 5, 3 + 6)
c(5, 7, 9)

You can also do the calculations with variables that represent vectors:

a <- c(1, 2, 3)
b <- c(4, 5, 6)
c <- a + b

A_vector <- c(1, 2, 3)
B_vector <- c(4, 5, 6)

Take the sum of A_vector and B_vector

total_vector <- A_vector + B_vector

Calculating total winnings (2)

Now you understand how R does arithmetic with vectors, it is time to get those Ferraris in your garage! First, you need to understand what the overall profit or loss per day of the week was. The total daily profit is the sum of the profit/loss you made on poker per day, and the profit/loss you made on roulette per day.

In R, this is just the sum of roulette_vector and poker_vector.

Assign to total_daily how much you won/lost on each day

total_daily <- poker_vector + roulette_vector

Display the daily total

total_daily
##    Monday   Tuesday Wednesday  Thursday    Friday 
##       116      -100       120      -470       250

Calculating total winnings (3)

Based on the previous analysis, it looks like you had a mix of good and bad days. This is not what your ego expected, and you wonder if there may be a very tiny chance you have lost money over the week in total?

A function that helps you to answer this question is sum(). It calculates the sum of all elements of a vector. For example, to calculate the total amount of money you have lost/won with poker you do:

total_poker <- sum(poker_vector)

Total winnings with poker

total_poker <- sum(poker_vector)
total_poker
## [1] 230

Total winnings with roulette

total_roulette <- sum(roulette_vector)
total_roulette
## [1] -314

Total winnings overall

total_week <- total_poker + total_roulette

Comparing total winnings

Oops, it seems like you are losing money. Time to rethink and adapt your strategy! This will require some deeper analysis…

After a short brainstorm in your hotel’s jacuzzi, you realize that a possible explanation might be that your skills in roulette are not as well developed as your skills in poker. So maybe your total gains in poker are higher (or > ) than in roulette.

Check if you realized higher total gains in poker than in roulette

total_poker > total_roulette
## [1] TRUE

Vector selection: the good times

Your hunch seemed to be right. It appears that the poker game is more your cup of tea than roulette.

Another possible route for investigation is your performance at the beginning of the working week compared to the end of it. You did have a couple of Margarita cocktails at the end of the week…

To answer that question, you only want to focus on a selection of the total_vector. In other words, our goal is to select specific elements of the vector. To select elements of a vector (and later matrices, data frames, …), you can use square brackets. Between the square brackets, you indicate what elements to select. For example, to select the first element of the poker vector, you type poker_vector[1]. To select the second element of the vector, you type poker_vector[2], etc. Notice that the first element in a vector in R has index 1, not 0 as in many other programming languages.

Define a new variable based on a selection

poker_wednesday <- poker_vector[3]

R also makes it possible to select multiple elements from a vector at once. Learn how in the next exercise!

Vector selection: the good times (2)

How about analyzing your midweek results?

To select multiple elements from a vector, you can add square brackets at the end of it. You can indicate between the brackets what elements should be selected. For example: suppose you want to select the first and the fifth day of the week: use the vector c(1, 5) between the square brackets. For example, the code below selects the first and fifth element of poker_vector:

poker_vector[c(1, 5)]

Define a new variable based on a selection

poker_midweek <- poker_vector[c(2,3,4)]

Continue to the next exercise to specialize in vector selection some more!

Vector selection: the good times (3)

Selecting multiple elements of poker_vector with c(2, 3, 4) is not very convenient. Many statisticians are lazy people by nature, so they created an easier way to do this: c(2, 3, 4) can be abbreviated to 2:4, which generates a vector with all natural numbers from 2 up to and including 4.

So, another way to find the mid-week results is poker_vector[2:4].

Define a new variable based on a selection

roulette_selection_vector <- roulette_vector[2:5]

The colon operator is extremely useful and very often used in R programming, so remember it well.

Vector selection: the good times (4)

Another way to tackle the previous exercise is by using the names of the vector elements (Monday, Tuesday, …) instead of their numeric positions. For example,

poker_vector["Monday"]

will select the first element of poker_vector because “Monday” is the name of that first element.

Just like you did in the previous exercise with numerics, you can also use the element names to select multiple elements, for example:

poker_vector[c("Monday","Tuesday")]

Select poker results for Monday, Tuesday and Wednesday

poker_start <- poker_vector[c("Monday", "Tuesday","Wednesday")]

Calculate the average of the elements in poker_start

mean(poker_start)
## [1] 36.66667

Good job! Apart from subsetting vectors by index or by name, you can also subset vectors by comparison. The next exercises will show you how!

Selection by comparison - Step 1

By making use of comparison operators, we can approach the previous question in a more proactive way.

The (logical) comparison operators known to R are:

  • < for less than
  • > for greater than
  • <= for less than or equal to
  • >= for greater than or equal to
  • == for equal to each other
  • != not equal to each other

As seen in the previous chapter, stating 6 > 5 returns TRUE. The nice thing about R is that you can use these comparison operators also on vectors. For example:

c(4, 5, 6) > 5
[1] FALSE FALSE TRUE

This command tests for every element of the vector if the condition stated by the comparison operator is TRUE or FALSE.

Which days did you make money on poker?

poker_vector > 0
##    Monday   Tuesday Wednesday  Thursday    Friday 
##      TRUE     FALSE      TRUE     FALSE      TRUE
selection_vector <- poker_vector > 0

Selection by comparison - Step 2

Working with comparisons will make your data analytical life easier. Instead of selecting a subset of days to investigate yourself (like before), you can simply ask R to return only those days where you realized a positive return for poker.

In the previous exercises you used selection_vector <- poker_vector > 0 to find the days on which you had a positive poker return. Now, you would like to know not only the days on which you won, but also how much you won on those days.

You can select the desired elements, by putting selection_vector between the square brackets that follow poker_vector:

poker_vector[selection_vector]

R knows what to do when you pass a logical vector in square brackets: it will only select the elements that correspond to TRUE in selection_vector.

Select from poker_vector these days

poker_winning_days <- poker_vector[selection_vector]
poker_winning_days
##    Monday Wednesday    Friday 
##       140        20       240

I printed out selection_vector and poker_winning_days along the way, just to be sure I was doing it right.

Advanced selection

Just like you did for poker, you also want to know those days where you realized a positive return for roulette.

Which days did you make money on roulette?

selection_vector <- roulette_vector > 0

Notice how we’re reusing selection_vector for the second time.

Select from roulette_vector these days

roulette_winning_days <- roulette_vector[selection_vector]

This exercise concludes the chapter on vectors. The next chapter will introduce you to the two-dimensional version of vectors: matrices.


3: Matrices

In this chapter, you will learn how to work with matrices in R. By the end of the chapter, you will be able to create matrices and understand how to do basic computations with them. You will analyze the box office numbers of the Star Wars movies and learn how to use matrices in R. May the force be with you!

What’s a matrix?

In R, a matrix is a collection of elements arranged into a fixed number of rows and columns. Since you are only working with rows and columns, a matrix is called two-dimensional.

You can construct a matrix in R with the matrix() function. Consider the following example:

matrix(1:9, byrow = TRUE, nrow = 3)

In the matrix() function:

  • the first argument is the collection of elements that R will arrange into the rows and columns of the matrix. Here, we use 1:9 which, as we’ve seen before, is a shortcut for c(1, 2, 3, 4, 5, 6, 7, 8, 9).
  • the next argument, byrow, indicates that the matrix is filled by the rows. If we want the matrix to be filled by the columns, we just place byrow = FALSE.
  • The third argument, nrow indicates that the matrix should have three rows.

Construct a matrix with 3 rows that contains the numbers 1 up to 9

matrix(1:9, byrow=TRUE, nrow = 3)
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

What happens, do you think, if you change TRUE in the above code to FALSE? Try it and see.

Analyze matrices, you shall

It is now time to get your hands dirty. In the following exercises you will analyze the box office numbers of the Star Wars franchise. May the force be with you!

In the editor, three vectors are defined. Each one represents the box office numbers from the first three Star Wars movies. The first element of each vector indicates the US box office revenue, the second element refers to the Non-US box office (source: Wikipedia).

In this exercise, you’ll combine all these figures into a single vector. Next, you’ll build a matrix from this vector.

Box office Star Wars (in millions!)

new_hope <- c(460.998, 314.4)
empire_strikes <- c(290.475, 247.900)
return_jedi <- c(309.306, 165.8)

In other words, three vectors, all of length 3, containing numerics.

Create box_office

box_office <- c(new_hope, empire_strikes, return_jedi)

In other words, one vector, called box_office, of length 6.

Construct star_wars_matrix

star_wars_matrix <- matrix(box_office, byrow = TRUE, nrow = 3)

In other words, a matrix, called star_wars_matrix, with 3 rows, which has been filled, by row, with the 6 elements of the vector box-office. This forces there to be two elements per row.

What happens if you try to fill a matrix with 3 rows using a vector containing 5 elements, or 7 elements, or 8 elements. Try it and see.

You’ll see that to fill a matrix containing n rows, you need a multiple of n elements. Draw a picture of a matrix to help you understand why.

The force is actually with you!

Naming a matrix

To help you remember what is stored in star_wars_matrix, you would like to add the names of the movies for the rows. Not only does this help you to read the data, but it is also useful to select certain elements from the matrix.

Previously we used the function names() to name the elements of a vector. As we’re dealing with two-dimensional arrays now, we need to be a little more specific. Similar to vectors, you can add names for the rows and the columns of a matrix

rownames(my_matrix) <- row_names_vector
colnames(my_matrix) <- col_names_vector

Below we’ll create two new vectors for use in this chapter: region, and titles. You will use these vectors to name the columns and rows of star_wars_matrix, respectively.

Create vectors region and titles, used for naming

region <- c("US", "non-US")
titles <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")

Name the columns with region

colnames(star_wars_matrix) <- region

Name the rows with titles

rownames(star_wars_matrix) <- titles

Check dimension of star_wars_matrix

dim(star_wars_matrix)
## [1] 3 2

Notice it gives you the number of rows first, then the number of columns.

How, using the vector box_office, would you create a matrix with 2 columns and 3 rows? We might look at this in one of our challenges.

Calculating the worldwide box office

The single most important thing for a movie in order to become an instant legend in Tinseltown is its worldwide box office figures.

To calculate the total box office revenue for the three Star Wars movies, you have to take the sum of the US revenue column and the non-US revenue column.

In R, when dealing with marices, the function rowSums() conveniently calculates the totals for each row of a matrix. This function creates a new vector:

rowSums(my_matrix)

which we can assign to a variable.

Calculate worldwide box office figures

worldwide_vector <- rowSums(star_wars_matrix)
worldwide_vector
##              A New Hope The Empire Strikes Back      Return of the Jedi 
##                 775.398                 538.375                 475.106

What does this vector tell you exactly?

What about the vector colSums(star_wars_matrix)? What will this vector tell us? Have a thought then give this vector a name.

us_and_abroad_vector <- colSums(star_wars_matrix)

Using dimnames()

Hopefully you’re starting to understand (if you didn’t already) that when we refer to matrices, we always refer to their rows first, then columns. For example, a 3 x 2 matrix is one with 3 rows and 2 columns.

This is true also when we build a matrix from scratch. Below is code which builds the same matrix we’ve been building over the past few exercises:

Construct star_wars_matrix

box_office <- c(460.998, 314.4, 290.475, 247.900, 309.306, 165.8)
star_wars_matrix <- matrix(box_office, nrow = 3, byrow = TRUE,
                           dimnames = list(c("A New Hope", "The Empire Strikes Back", "Return of the Jedi"), 
                                           c("US", "non-US")))

Notice how rather than use the functions rownames() and colnames(), we used the function dimnames() to names the rows and columns. We wrote dimnames = list(c("row", "names"), c("column", "names")).

We also made use of the list function, which we’ll learn to use fully in a later chapter in this module.

Adding a column for the Worldwide box office

In a previous exercise you calculated the vector that contained the worldwide box office receipt for each of the three Star Wars movies. However, this vector is not yet part of star_wars_matrix.

You can add a column or multiple columns to a matrix with the cbind() function, which merges matrices and/or vectors together by column. For example:

big_matrix <- cbind(matrix1, matrix2, vector1 ...)

Let’s add the worldwide totals to the matrix.

Bind the new variable worldwide_vector as a column to star_wars_matrix

all_wars_matrix <- cbind(star_wars_matrix, worldwide_vector)

Obviously this step will only work if the two matrices have the same number of rows.

After adding this column to our matrix, the logical next step is to add rows. Learn how in the next exercise…

Adding a row

For this exercise, we’ll use our existing matrix star_wars_matrix, which covered the original trilogy of movies, and a second matrix star_wars_matrix2, also 3 x 2, which will contain the same data for the prequels trilogy.

By the way, if you ever want to check out the contents of the workspace you’re working in, you can type ls() in the console.

Creating the prequels matrix star_wars_matrix2

box_office <- c(474.5, 552.5, 310.7, 338.7, 380.3, 468.5)
star_wars_matrix2 <- matrix(box_office, nrow = 3, byrow = TRUE,
                            dimnames = list(c("The Phantom Menace", "Attack of the Clones", "Revenge of the Sith"),
                                            c("US", "non-US")))

Combine both Star Wars trilogies in one matrix

all_wars_matrix <- rbind(star_wars_matrix, star_wars_matrix2)

Again, we had to be sure, which we were, that the two matrices contained the same number of columns into to be able to successfully merge them.

The total box office revenue for the entire saga

Would you use colSums()or rowSums() on the matrix all_wars_matrix to calculate the total box office revenue for the entire saga in the US versus abroad?

That’s right! It’s colSums()!

Total revenue for US and non-US

total_revenue_vector <- colSums(all_wars_matrix)

Selection of matrix elements

Similar to vectors, you can use the square brackets [ ] to select one or multiple elements from a matrix. Whereas vectors have one dimension, matrices have two dimensions. You should therefore use a comma to separate the rows you want to select from the columns. Notice again that rows go first, then columns. For example:

my_matrix[1,2] selects the element at the first row and second column.
my_matrix[1:3,2:4] results in a matrix with the data on the rows 1, 2, 3 and columns 2, 3, 4.

If you want to select all elements of a row or a column, no number is needed before or after the comma, respectively:

my_matrix[,1] selects all elements of the first column.
my_matrix[1,] selects all elements of the first row.

What will my_matrix[1] select? Does it even selest anything??

Back to Star Wars with this newly acquired knowledge!

Select the non-US revenue for all movies

non_us_all <- all_wars_matrix[,2]

Average non-US revenue

mean(non_us_all)
## [1] 347.9667

Select the non-US revenue for first two movies

non_us_some <- all_wars_matrix[1:2,2]

Average non-US revenue for first two movies

mean(non_us_some)
## [1] 281.15

A little arithmetic with matrices

Similar to what you have learned with vectors, the standard operators like +, -, /, *, etc. work in an element-wise way on matrices in R.

For example, 2 * my_matrix multiplies each element of my_matrix by two.

As a newly-hired data analyst for Lucasfilm, it is your job to find out how many visitors went to each movie for each geographical area. You already have the total revenue figures (in the matrix all_wars_matrix). Assume that the price of a ticket was 5 dollars. Simply dividing the box office numbers by this ticket price gives you the number of visitors.

Estimate the visitors

visitors <- all_wars_matrix / 5

A little arithmetic with matrices (2)

After looking at the result of the previous exercise, big boss Lucas points out that the ticket prices went up over time. He asks to redo the analysis based on the prices you can find in ticket_prices_matrix (source: imagination).

_Those who are familiar with matrices should note that this is not the standard matrix multiplication for which you should use %*% in R._

Creating ticket_prices_matrix

ticket_prices <- c(5.0, 5.0, 6.0, 6.0, 7.0, 7.0, 4.0, 4.0, 4.5, 4.5, 4.9, 4.9)
ticket_prices_matrix <- matrix(ticket_prices, nrow = 6, byrow = TRUE,
                               dimnames = list(c("A New Hope", "The Empire Strikes Back", "Return of the Jedi","The Phantom Menace", "Attack of the Clones", "Revenge of the Sith"),
                                               c("US", "non-US")))

Estimated number of visitors

visitors <- all_wars_matrix / ticket_prices_matrix

US visitors

us_visitors <- visitors[,1]

Average number of US visitors

mean(us_visitors)
## [1] 75.01339

This exercise concludes the chapter on matrices. Next stop on your journey through the R language: factors.


4: Factors

Data often falls into a limited number of categories. For example, human hair color can be categorized as black, brown, blond, red, grey, or white—and perhaps a few more options for people who color their hair. In R, categorical data is stored in factors. Factors are very important in data analysis, so start learning how to create, subset, and compare them now.

What’s a factor and why would you use it?

In this chapter you dive into the wonderful world of factors.

The term factor refers to a statistical data type used to store categorical variables. The difference between a categorical variable and a continuous variable is that a categorical variable can belong to a limited number of categories. A continuous variable, on the other hand, can correspond to an infinite number of values.

It is important that R knows whether it is dealing with a continuous or a categorical variable, as the statistical models you will develop in the future treat both types differently. (You will see later why this is the case.)

A good example of a categorical variable is sex. In many circumstances you can limit the sex categories to “Male” or “Female”. (Sometimes you may need different categories. For example, you may need to consider chromosomal variation, hermaphroditic animals, or different cultural norms, but you will always have a finite number of categories.)

Assign to the variable “theory” what this chapter is about!

theory <- "factors used for categorical variables"

What’s a factor and why would you use it? (2)

To create factors in R, you make use of the function factor(). First thing that you have to do is create a vector that contains all the observations that belong to a limited number of categories. For example, sex_vector contains the sex of 5 different individuals:

sex_vector <- c("Male","Female","Female","Male","Male")

It is clear that there are two categories, or in R-terms factor levels, at work here: “Male” and “Female”.

The function factor() will encode the vector as a factor:

factor_sex_vector <- factor(sex_vector)

Sex vector

sex_vector <- c("Male", "Female", "Female", "Male", "Male")
class(sex_vector[2])
## [1] "character"
class(sex_vector)
## [1] "character"

So currently R does not know that sex_vector are (or should be treated as) categories.

Convert sex_vector to a factor

factor_sex_vector <- factor(sex_vector)

What’s a factor and why would you use it? (3)

There are two types of categorical variables: a nominal categorical variable and an ordinal categorical variable.

A nominal variable is a categorical variable without an implied order. This means that it is impossible to say that ‘one is worth more than the other’. For example, think of the categorical variable animals_vector with the categories “Elephant”, “Giraffe”, “Donkey” and “Horse”. Here, it is impossible to say that one stands above or below the other. (Note that some of you might disagree ;-) ).

In contrast, ordinal variables do have a natural ordering. Consider for example the categorical variable temperature_vector with the categories: “Low”, “Medium” and “High”. Here it is obvious that “Medium” stands above “Low”, and “High” stands above “Medium”.

Animals

animals_vector <- c("Elephant", "Giraffe", "Donkey", "Horse")
factor_animals_vector <- factor(animals_vector)
factor_animals_vector
## [1] Elephant Giraffe  Donkey   Horse   
## Levels: Donkey Elephant Giraffe Horse

Temperature

temperature_vector <- c("High", "Low", "High","Low", "Medium")
factor_temperature_vector <- factor(temperature_vector, order = TRUE, levels = c("Low", "Medium", "High"))
factor_temperature_vector
## [1] High   Low    High   Low    Medium
## Levels: Low < Medium < High

Can you already tell what’s happening in this exercise? Awesome! Continue to the next exercise and get into the details of factor levels.

Factor levels

When you first get a data set, you will often notice that it contains factors with specific factor levels. However, sometimes you will want to change the names of these levels for clarity or other reasons. R allows you to do this with the function levels():

levels(factor_vector) <- c("name1", "name2",...)

A good illustration is the raw data that is provided to you by a survey. A common question for every questionnaire is the sex of the respondent. Here, for simplicity, just two categories were recorded, “M” and “F”. (You usually need more categories for survey data; either way, you use a factor to store the categorical data.)

survey_vector <- c("M", "F", "F", "M", "M")

Recording the sex with the abbreviations “M” and “F” can be convenient if you are collecting data with pen and paper, but it can introduce confusion when analyzing the data. At that point, you will often want to change the factor levels to “Male” and “Female” instead of “M” and “F” for clarity.

Watch out: the order with which you assign the levels is important. If you type levels(factor_survey_vector), you’ll see that it outputs [1] “F” “M”. If you don’t specify the levels of the factor when creating the vector, R will automatically assign them alphabetically. To correctly map “F” to “Female” and “M” to “Male”, the levels should be set to c("Female", "Male"), in this order.

Code to build factor_survey_vector

survey_vector <- c("M", "F", "F", "M", "M")
factor_survey_vector <- factor(survey_vector)
levels(factor_survey_vector)
## [1] "F" "M"

Specify the levels of factor_survey_vector

levels(factor_survey_vector) <- c("Female", "Male")

Summarizing a factor

After finishing this course, one of your favorite functions in R will be summary(). This will give you a quick overview of the contents of a variable:

summary(my_var)

Going back to our survey, you would like to know how many “Male” responses you have in your study, and how many “Female” responses. The summary() function gives you the answer to this question.

Generate summary for survey_vector

summary(survey_vector)
##    Length     Class      Mode 
##         5 character character

Generate summary for factor_survey_vector

summary(factor_survey_vector)
## Female   Male 
##      2      3

Have a look at the output. The fact that you identified “Male” and “Female” as factor levels in factor_survey_vector enables R to show the number of elements for each category.

Battle of the sexes

You might wonder what happens when you try to compare elements of a factor. In factor_survey_vector you have a factor with two levels: “Male” and “Female”. But how does R value these relative to each other?

Male

male <- factor_survey_vector[1]

Female

female <- factor_survey_vector[2]

Battle of the sexes: Male ‘larger’ than female?

male > female
## Warning in Ops.factor(male, female): '>' not meaningful for factors
## [1] NA

By default, R returns NA when you try to compare values in a factor, since the idea doesn’t make sense. Next you’ll learn about ordered factors, where more meaningful comparisons are possible.

Ordered factors

Since “Male” and “Female” are unordered (or nominal) factor levels, R returns a warning message, telling you that the greater than operator is not meaningful. As seen before, R attaches an equal value to the levels for such factors.

But this is not always the case! Sometimes you will also deal with factors that do have a natural ordering between its categories. If this is the case, we have to make sure that we pass this information to R…

Let us say that you are leading a research team of five data analysts and that you want to evaluate their performance. To do this, you track their speed, evaluate each analyst as “slow”, “medium” or “fast”, and save the results in speed_vector.

Create speed_vector

speed_vector <- c("medium", "slow", "slow","medium", "fast")

Ordered factors (2)

speed_vector should be converted to an ordinal factor since its categories have a natural ordering. By default, the function factor() transforms speed_vector into an unordered factor. To create an ordered factor, you have to add two additional arguments: ordered and levels.

factor(some_vector,
ordered = TRUE,
levels = c(“lev1”, “lev2” …))

By setting the argument ordered to TRUE in the function factor(), you indicate that the factor is ordered. With the argument levels you give the values of the factor in the correct order.

Convert speed_vector to ordered factor vector

factor_speed_vector <- factor(speed_vector,
                        ordered = TRUE,
                        levels = c("slow", "medium", "fast"))

Comparing ordered factors

Having a bad day at work, ‘data analyst number two’ enters your office and starts complaining that ‘data analyst number five’ is slowing down the entire project. Since you know that ‘data analyst number two’ has the reputation of being a smarty-pants, you first decide to check if his statement is true.

The fact that factor_speed_vector is now ordered enables us to compare different elements (the data analysts in this case). You can simply do this by using the well-known operators.

Factor value for second data analyst

da2 <- factor_speed_vector[2]

Factor value for fifth data analyst

da5 <- factor_speed_vector[5]

Is data analyst 2 faster than data analyst 5?

da2 > da5
## [1] FALSE

What do the results tell you? Data analyst two is complaining about the data analyst five while in fact they are the one slowing everything down! This concludes the chapter on factors. With a solid basis in vectors, matrices and factors, you’re ready to dive into the wonderful world of data frames, a very important data structure in R!


5: Data frames

Most datasets you will be working with will be stored as data frames. By the end of this chapter, you will be able to create a data frame, select interesting parts of a data frame, and order a data frame according to certain variables.

What’s a data frame?

You may remember from the chapter about matrices that all the elements that you put in a matrix should be of the same type. Back then, your data set on Star Wars only contained numeric elements.

When doing a market research survey, however, you often have questions such as:

‘Are you married?’ or ‘yes/no’ questions (logical) ‘How old are you?’ (numeric) ‘What is your opinion on this product?’ or other ‘open-ended’ questions (character) … The output, namely the respondents’ answers to the questions formulated above, is a data set of different data types. You will often find yourself working with data sets that contain different data types instead of only one.

A data frame has the variables of a data set as columns and the observations as rows. This will be a familiar concept for those coming from different statistical software packages such as SAS or SPSS.

Quick, have a look at your data set

Wow, that is a lot of cars!

Working with large data sets is not uncommon in data analysis. When you work with (extremely) large data sets and data frames, your first task as a data analyst is to develop a clear understanding of its structure and main elements. Therefore, it is often useful to show only a small part of the entire data set.

So how to do this in R? Well, the function head() enables you to show the first observations of a data frame. Similarly, the function tail() prints out the last observations in your data set.

Both head() and tail() print a top line called the ‘header’, which contains the names of the different variables in your data set.

Call head() on mtcars

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
tail(mtcars)
##                 mpg cyl  disp  hp drat    wt qsec vs am gear carb
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

So, what do we have in this data set? For example, hp represents the car’s horsepower; the Datsun has the lowest horse power of the 6 cars that are displayed. For a full overview of the variables’ meaning, type ?mtcars in the console and read the help page.

Having used ?mtcars to view an explanation of the variables represented in this dataframe, what does the variable am refer to? And hp?

Have a look at the structure

Another method that is often used to get a rapid overview of your data is the function str(). The function str() shows you the structure of your data set. For a data frame it tells you:

  • The total number of observations (e.g. 32 types of car were tested)
  • The total number of variables (e.g. 11 car features)
  • A full list of the variables names (e.g. mpg, cyl … )
  • The data type of each variable (e.g. num)
  • The first observations

Applying the str() function will often be the first thing that you do when receiving a new data set or data frame. It is a great way to get more insight into your data set before diving into the real analysis.

Investigate the structure of mtcars

mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Creating a data frame

Since using built-in data sets is not even half the fun of creating your own data sets, the rest of this chapter is based on your personally developed data set. Put your jet pack on because it is time for some space exploration!

As a first goal, you want to construct a data frame that describes the main characteristics of eight planets in our solar system. According to your good friend Buzz, the main features of a planet are:

  • The type of planet (Terrestrial or Gas Giant).
  • The planet’s diameter relative to the diameter of the Earth.
  • The planet’s rotation across the sun relative to that of the Earth.
  • If the planet has rings or not (TRUE or FALSE).

After doing some high-quality research on Wikipedia, you feel confident enough to create the necessary vectors: name, type, diameter, rotation and rings.

Definition of vectors

name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)

The first element in each of these vectors correspond to the first observation.

Create a data frame from the vectors

Next we construct a data frame with the data.frame() function. As arguments, you pass the vectors from before: they will become the different columns of your data frame. Because every column has the same length, the vectors you pass should also have the same length. But don’t forget that it is possible (and likely) that they contain different types of data.

planets_df <- data.frame(name, type, diameter, rotation, rings)

The logical next step, as you know by now, is inspecting the data frame you just created.

Creating a data frame (2)

The planets_df data frame we just created should have 8 observations and 5 variables. Let’s double check!

Check the structure of planets_df

str(planets_df)
## 'data.frame':    8 obs. of  5 variables:
##  $ name    : Factor w/ 8 levels "Earth","Jupiter",..: 4 8 1 3 2 6 7 5
##  $ type    : Factor w/ 2 levels "Gas giant","Terrestrial planet": 2 2 2 2 1 1 1 1
##  $ diameter: num  0.382 0.949 1 0.532 11.209 ...
##  $ rotation: num  58.64 -243.02 1 1.03 0.41 ...
##  $ rings   : logi  FALSE FALSE FALSE FALSE TRUE TRUE ...

Now that you have a clear understanding of the planets_df data set, it’s time to see how you can select elements from it.

Selection of data frame elements

Similar to vectors and matrices, you select elements from a data frame with the help of square brackets [ ]. By using a comma, you can indicate what to select from the rows and the columns respectively. For example:

  • my_df[1,2] selects the value at the first row and second column in my_df.
  • my_df[1:3,2:4] selects the values that appear in columns 2, 3, 4 of rows 1, 2, 3 in my_df.
  • my_df[1, ] selects all elements of the first row.
  • my_df[, 4] selects all elements of the fourth column.

Selection of data frame elements (2)

Instead of using numerics to select elements of a data frame, you can also use the variable names to select columns of a data frame.

Suppose you want to select the first three elements of the type column. One way to do this is

planets_df[1:3,2]

A possible disadvantage of this approach is that you have to know (or look up) the column number of type, which gets hard if you have a lot of variables. It is often easier to just make use of the variable name:

planets_df[1:3,"type"]

Select the first 5 values of diameter column

planets_df[1:5, "diameter"]
## [1]  0.382  0.949  1.000  0.532 11.209
planets_df[, "diameter"]
## [1]  0.382  0.949  1.000  0.532 11.209  9.449  4.007  3.883

Only planets with rings

You will often want to select an entire column, namely one specific variable from a data frame. If you want to select all elements of the variable diameter, for example, both of these will do the trick:

planets_df[,3] planets_df[,"diameter"]

However, there is a short-cut. If your columns have names, you can use the $ sign:

planets_df$diameter

View the diamter variable from planets_df

planets_df$diameter
## [1]  0.382  0.949  1.000  0.532 11.209  9.449  4.007  3.883

Select the rings variable from planets_df

rings_vector <- planets_df$rings

Only planets with rings (2)

You probably remember from high school that some planets in our solar system have rings and others do not. Unfortunately you can not recall their names. Could R help you out?

If you type rings_vector in the console, you get:

[1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE

This means that the first four observations (or planets) do not have a ring (FALSE), but the other four do (TRUE). However, you do not get a nice overview of the names of these planets, their diameter, etc. Let’s try to use rings_vector to select the data for the four planets with rings.

Adapt the code to select all columns for planets with rings

planets_df[rings_vector,]
##      name      type diameter rotation rings
## 5 Jupiter Gas giant   11.209     0.41  TRUE
## 6  Saturn Gas giant    9.449     0.43  TRUE
## 7  Uranus Gas giant    4.007    -0.72  TRUE
## 8 Neptune Gas giant    3.883     0.67  TRUE

Notice that we’ve put this logical vector before the comma. This tells the R to display those observations associate with a TRUE value, and not those with a FALSE value.

What would happen if you put rings_vector after the comma? And why? Try it and see!

Using planets_df[rings_vector,] is a rather tedious solution. The next exercise will teach you how to do it in a more concise way.

Only planets with rings but shorter

So what exactly did you learn in the previous exercises? You selected a subset from a data frame (planets_df) based on whether or not a certain condition was true (rings or no rings), and you managed to pull out all relevant data. Pretty awesome! By now, NASA is probably already flirting with your CV ;-).

Now, let us move up one level and use the function subset(). You should see the subset() function as a short-cut to do exactly the same as what you did in the previous exercises.

subset(my_df, subset = some_condition)

The first argument of subset() specifies the data set for which you want a subset. By adding the second argument, you give R the necessary information and conditions to select the correct subset.

The code below will give the exact same result as you got in the previous exercise, but this time, you didn’t need to create the rings_vector!

subset(planets_df, subset = rings)

Select planets with diameter < 1

subset(planets_df, diameter < 1)
##      name               type diameter rotation rings
## 1 Mercury Terrestrial planet    0.382    58.64 FALSE
## 2   Venus Terrestrial planet    0.949  -243.02 FALSE
## 4    Mars Terrestrial planet    0.532     1.03 FALSE
subset(planets_df, rings == TRUE)
##      name      type diameter rotation rings
## 5 Jupiter Gas giant   11.209     0.41  TRUE
## 6  Saturn Gas giant    9.449     0.43  TRUE
## 7  Uranus Gas giant    4.007    -0.72  TRUE
## 8 Neptune Gas giant    3.883     0.67  TRUE

Not only is the subset() function more concise, it is probably also more understandable for people who read your code.

Sorting

Making and creating rankings is one of mankind’s favorite affairs. These rankings can be useful (best universities in the world), entertaining (most influential movie stars) or pointless (best 007 look-a-like).

In data analysis you can sort your data according to a certain variable in the data set. In R, this is done with the help of the function order().

order() is a function that gives you the ranked position of each element when it is applied on a variable, such as a vector for example:

Sort the vector a

a <- c(1000, 10, 100)
order(a)
## [1] 2 3 1
a
## [1] 1000   10  100

10, which is the second element in a, is the smallest element, so 2 comes first in the output of order(a). 100, which is the third element in a is the second smallest element, so 3 comes second in the output of order(a).

Note that order(a) has not altered a itself.

We can use the output of order(a) to reshuffle a:

Reshuffle a in ascending order

a[order(a)]
## [1]   10  100 1000

Once more, be aware that we haven’t altered a at all. We could, by assigning a[order(a)] to a (or to b if we wanted to keep a intact).

Play around with the order function in the console

Now let’s use the order() function to sort your data frame!

Sorting your data frame

Alright, now that you understand the order() function, let us do something useful with it. You would like to rearrange your data frame such that it starts with the smallest planet and ends with the largest one. This will be a sort on the diameter column.

Use order() to create positions

positions <- order(planets_df$diameter)

Use positions to sort planets_df

planets_df[positions, ]
##      name               type diameter rotation rings
## 1 Mercury Terrestrial planet    0.382    58.64 FALSE
## 4    Mars Terrestrial planet    0.532     1.03 FALSE
## 2   Venus Terrestrial planet    0.949  -243.02 FALSE
## 3   Earth Terrestrial planet    1.000     1.00 FALSE
## 8 Neptune          Gas giant    3.883     0.67  TRUE
## 7  Uranus          Gas giant    4.007    -0.72  TRUE
## 6  Saturn          Gas giant    9.449     0.43  TRUE
## 5 Jupiter          Gas giant   11.209     0.41  TRUE

This exercise concludes the chapter on data frames. Remember that data frames are extremely important in R, you will need them all the time. Another very often used data structure is the list. This will be the subject of the next chapter!


6: Lists

As opposed to vectors, lists can hold components of different types, just as your to-do lists can contain different categories of tasks. This chapter will teach you how to create, name, and subset these lists.

Lists, why would you need them?

Congratulations! At this point in the course you are already familiar with:

  • Vectors (one dimensional array): can hold numeric, character or logical values. The elements in a vector all have the same data type.

  • Matrices (two dimensional array): can hold numeric, character or logical values. The elements in a matrix all have the same data type.

  • Data frames (two-dimensional objects): can hold numeric, character or logical values. Within a column all elements have the same data type, but different columns can be of different data type.

Pretty sweet for an R newbie, right? ;-)

Lists, why would you need them? (2)

A list in R is similar to your to-do list at work or school: the different items on that list most likely differ in length, characteristic, and type of activity that has to be done.

A list in R allows you to gather a variety of objects under one name (that is, the name of the list) in an ordered way. These objects can be matrices, vectors, data frames, even other lists, etc. It is not even required that these objects are related to each other in any way.

You could say that a list is some kind super data type: you can store practically any piece of information in it!

Cool. Let’s get our hands dirty!

Creating a list

Let us create our first list! To construct a list you use the function list():

my_list <- list(comp1, comp2 ...)

The arguments to the list function are the list components. Remember, these components can be matrices, vectors, other lists, …

Vector with numerics from 1 up to 10

my_vector <- 1:10 

Matrix with numerics from 1 up to 9

my_matrix <- matrix(1:9, ncol = 3)

First 10 elements of the built-in data frame mtcars

my_df <- mtcars[1:10,]

Construct list with these different elements:

my_list <- list(my_vector, my_matrix, my_df)

Creating a named list

Well done, you’re on a roll!

Just like on your to-do list, you want to avoid not knowing or remembering what the components of your list stand for. That is why you should give names to them:

my_list <- list(name1 = your_comp1, name2 = your_comp2)

This creates a list with components that are named name1, name2, and so on. If you want to name your lists after you’ve created them, you can use the names() function as you did with vectors. The following commands are fully equivalent to the assignment above:

my_list <- list(your_comp1, your_comp2) names(my_list) <- c("name1", "name2")

Adapt list() call to give the components names

my_list <- list(vec = my_vector, mat = my_matrix, df = my_df)

Creating a named list (2)

Being a huge movie fan (remember your job at LucasFilms), you decide to start storing information on good movies with the help of lists.

Start by creating a list for the movie “The Shining”.

Creating mov and act

mov <- "The Shining"
act <- c("Jack Nicholson", "Shelley Duvall", "Danny Lloyd","Scatman Crothers", "Barry Nelson")

Creating rev

scores <- c(4.5, 4, 5)
sources <- c("IMDb1", "IMDb2", "IMDb3")
comments <- c("Best Horror Film I Have Ever Seen", "A truly brilliant and scary film from Stanley Kubrick", "A masterpiece of psychological horror")
rev <- data.frame(scores, sources, comments)

Finish the code to build shining_list

shining_list <- list(moviename = mov, actors = act, reviews = rev)

Wonderful! You now know how to construct and name lists. As in the previous chapters, let’s look at how to select elements for lists. Head over to the next exercise

Selecting elements from a list

Your list will often be built out of numerous elements and components. Therefore, getting a single element, multiple elements, or a component out of it is not always straightforward.

One way to select a component is using the numbered position of that component. For example, to “grab” the first component of shining_list you type

shining_list[[1]]

A quick way to check this out is typing it in the console. Important to remember: to select elements from vectors, you use single square brackets: [ ]. Don’t mix them up!

You can also refer to the names of the components, with [[ ]] or with the $ sign. Two ways of selecting the data frame representing the reviews:

shining_list[["reviews"]]
shining_list$reviews

Besides selecting components, you often need to select specific elements out of these components. For example, with shining_list[[2]][1] you select from the second component, actors (shining_list[[2]]), the first element ([1]). When you type this in the console, you will see the answer is Jack Nicholson.

Creating a new list for another movie

You found reviews of another, more recent, Jack Nicholson movie: The Departed!

Scores Comments
4.6 I would watch it again
5 Amazing!
4.8 I liked it
5 One of the best movies
4.2 Fascinating plot

It would be useful to collect together all the pieces of information about the movie, like the title, actors, and reviews into a single variable. Since these pieces of data are different shapes, it is natural to combine them in a list variable.

Create movie_title and movie_actors

movie_title <- "The Departed"
movie_actors <- c("Leonardo DiCaprio", "Matt Damon", "Jack Nicholson", "Mark Wahlberg", "Vera Farmiga", "Martin Sheen")

Use the table from the exercise to define the comments and scores vectors

scores <- c(4.6, 5, 4.8, 5, 4.2)
comments <- c("I would watch it again", "Amazing!", "I liked it", "One of the best movies", "Fascinating plot")

Save the average of the scores vector as avg_review

avg_review <- mean(scores)

Combine scores and comments into the reviews_df data frame

reviews_df <- data.frame(scores, comments)

Create and print out a list, called departed_list

departed_list <- list(movie_title, movie_actors, reviews_df, avg_review)
departed_list
## [[1]]
## [1] "The Departed"
## 
## [[2]]
## [1] "Leonardo DiCaprio" "Matt Damon"        "Jack Nicholson"   
## [4] "Mark Wahlberg"     "Vera Farmiga"      "Martin Sheen"     
## 
## [[3]]
##   scores               comments
## 1    4.6 I would watch it again
## 2    5.0               Amazing!
## 3    4.8             I liked it
## 4    5.0 One of the best movies
## 5    4.2       Fascinating plot
## 
## [[4]]
## [1] 4.72

Good work! You successfully created another list of movie information, and combined different components into a single list. Congratulations on finishing the course!


Module 2: Intermediate R

Intermediate R is the next stop on your journey in mastering the R programming language. In this R training, you will learn about conditional statements, loops, and functions to power your own R scripts. Next, make your R code more efficient and readable using the apply functions. Finally, the utilities chapter gets you up to speed with regular expressions in R, data structure manipulations, and times and dates. This course will allow you to take the next step in advancing your overall knowledge and capabilities while programming in R.

1: Conditionals and Control Flow

In this chapter, you’ll learn about relational operators for comparing R objects, and logical operators like “and” and “or” for combining TRUE and FALSE values. Then, you’ll use this knowledge to build conditional statements.

Video: Relational Operators

Equality

The most basic form of comparison is equality. Let’s briefly recap its syntax. The following statements all evaluate to TRUE (feel free to try them out in the console).

Playing around with equalities and inequalities

3 == (2 + 1)
## [1] TRUE
"intermediate" != "r"
## [1] TRUE
TRUE != FALSE
## [1] TRUE
"Rchitect" != "rchitect"
## [1] TRUE

Notice from the last expression that R is case sensitive: “R” is not equal to “r”. Keep this in mind when solving the exercises in this chapter!

Comparison of logicals

TRUE == FALSE
## [1] FALSE

Comparison of numerics

-6 * 14 != 17 - 101
## [1] FALSE

Comparison of character strings

"useR" == "user"
## [1] FALSE

Compare a logical with a numeric

TRUE == 1
## [1] TRUE

Awesome! Since TRUE coerces to 1 under the hood, TRUE == 1 evaluates to TRUE. Make sure not to mix up == (comparison) and = (using for settings in functions). == is what needed to check the equality of R objects.

Greater and less than

Apart from equality operators, Filip also introduced the less than and greater than operators: < and >. You can also add an equal sign to express less than or equal to or greater than or equal to, respectively. Have a look at the following R expressions, that all evaluate to FALSE:

(1 + 2) > 4
"dog" < "Cats"
TRUE <= FALSE

Remember that for string comparison, R determines the greater than relationship based on alphabetical order. Also, keep in mind that TRUE is treated as 1 for arithmetic, and FALSE is treated as 0. Therefore, FALSE < TRUE is TRUE.

Comparison of numerics

-6 * 5 + 2 >= -10 + 1
## [1] FALSE

Comparison of character strings

"raining" <= "raining dogs"
## [1] TRUE

Comparison of logicals

TRUE > FALSE
## [1] TRUE

Make sure to have a look at the console output to see if R returns the results you expected.

Compare vectors

You are already aware that R is very good with vectors. Without having to change anything about the syntax, R’s relational operators also work on vectors. But be careful: the comparison is element-by-element, so the tow vectors must have the same number of elements (i.e. they must be the same length).

Let’s go back to the example that was started in the video. You want to figure out whether your activity on social media platforms have paid off and decide to look at your results for LinkedIn and Facebook. The sample code in the editor initializes the vectors linkedin and facebook. Each of the vectors contains the number of profile views your LinkedIn and Facebook profiles had over the last seven days.

Create the vectors linked and facebook (used in the video)

linkedin <- c(16, 9, 13, 5, 2, 17, 14)
facebook <- c(17, 7, 5, 16, 8, 13, 14)

Quiet days

linkedin <= 5
## [1] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE

Compare matrices

R’s ability to deal with different data structures for comparisons does not stop at vectors. Matrices and relational operators also work together seamlessly!

First we’ll store the LinkedIn and Facebook data in a matrix (rather than in vectors). We’ll call this matrix views. The first row contains the LinkedIn information; the second row the Facebook information. The original vectors facebook and linkedin are still available as well.

Create social media data matrix

views <- matrix(c(linkedin, facebook), nrow = 2, byrow = TRUE)

When does views equal 13?

views == 13
##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]
## [1,] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE

When is views less than or equal to 14?

views <= 14
##       [,1] [,2] [,3]  [,4] [,5]  [,6] [,7]
## [1,] FALSE TRUE TRUE  TRUE TRUE FALSE TRUE
## [2,] FALSE TRUE TRUE FALSE TRUE  TRUE TRUE

This exercise concludes the part on comparators. Now that you know how to query the relation between R objects, the next step will be to use the results to alter the behavior of your programs. Find out all about that in the next video!

Video: Logical Operators

& and |

Before you work your way through the next exercises, have a look at the following R expressions. All of them will evaluate to TRUE:

TRUE & TRUE
FALSE | TRUE
5 <= 5 & 2 < 3
3 < 4 | 7 < 6

Watch out: 3 < x < 7 to check if x is between 3 and 7 will not work; you’ll need 3 < x & x < 7 for that.

In this exercise, you’ll be working with the last variable. We’ll make this variable equal the last value of the linkedin vector that you’ve worked with previously. The linkedin vector represents the number of LinkedIn views your profile had in the last seven days, remember?

Defining the last variable

last <- tail(linkedin, 1)

Is last under 5 or above 10?

last < 5 | last > 10
## [1] TRUE

Is last between 15 (exclusive) and 20 (inclusive)?

last > 15 & last <= 20
## [1] FALSE

Have one last look at the console before proceeding; do the results of the different expressions make sense?

& and | (2)

Like relational operators, logical operators work perfectly fine with vectors and matrices.

Ready for some advanced queries to gain more insights into your social outreach?

linkedin exceeds 10 but facebook below 10

linkedin > 10 & facebook < 10
## [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

When were one or both visited at least 12 times?

linkedin >= 12 | facebook >= 12
## [1]  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE

When is views between 11 (exclusive) and 14 (inclusive)?

views > 11 & views <= 14
##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6] [,7]
## [1,] FALSE FALSE  TRUE FALSE FALSE FALSE TRUE
## [2,] FALSE FALSE FALSE FALSE FALSE  TRUE TRUE

You’ll have noticed how easy it is to use logical operators to vectors and matrices. What do these results tell us? The third day of the recordings was the only day where your LinkedIn profile was visited more than 10 times, while your Facebook profile wasn’t. Can you draw similar conclusions for the other results?

Question: Reverse the result: !

Blend it all together

With the things you’ve learned by now, you’re able to solve pretty cool problems.

Instead of recording the number of views for your own LinkedIn profile, suppose you conducted a survey inside the company you’re working for. You’ve asked every employee with a LinkedIn profile how many visits their profile has had over the past seven days. The data will be stored in the matrix employees_views, then a data fram li_df created from this matrix, with appropriate names for the rows. Finally, names will be added to the columns.

Creating data frame li_df

employees_views <- matrix(c(2, 3, 3, 6, 4, 2, 0, 19, 23, 18, 22, 23, 29, 25, 24, 18, 15, 19, 18, 22, 17, 22, 18, 27, 26, 19, 21, 25, 25, 25, 26, 31, 24, 36, 37, 22, 20, 29, 26, 23, 22, 29, 0, 4, 2, 2, 3, 4, 2, 12, 3, 15, 7, 1, 15, 11, 19, 22, 22, 19, 25, 24, 23, 23, 12, 19, 25, 18, 22, 22, 29, 27, 23, 25, 29, 30, 17, 13,   13, 20, 17, 12, 22, 20, 7, 17, 9, 5, 11, 9, 9, 26, 27, 28, 36, 29, 31, 30, 7, 6, 4, 11, 5, 5, 15, 32, 35, 31, 35, 24, 25, 36, 7, 17, 9, 12, 13, 6, 12, 9, 6, 3, 12, 3, 8, 6, 0, 1, 11, 6, 0, 4, 11, 9, 12, 6, 13, 12, 13, 11, 6, 15, 15, 10, 9, 7, 18, 17,   17, 12, 4, 14, 17, 7, 1, 12, 8, 2, 4, 4, 11, 5, 8, 0, 1, 6, 3, 1, 2, 7, 5, 3, 1, 5, 5, 29, 25, 32, 28,  28, 27, 27, 17, 15, 17, 23, 23, 17, 22, 26, 32, 33, 30, 33,   28, 26, 27, 29, 24, 29, 26, 31, 28, 4, 1, 1, 2, 1, 7, 4, 22, 22, 17, 20, 14, 19, 13, 9, 11, 7, 10, 8, 15, 5,  6, 5, 12, 5,   17, 17, 4, 18, 17, 12, 22, 22, 13, 12, 2, 12, 13, 7, 10, 6, 2, 32, 26, 20, 23, 24, 25, 21,  5, 13, 12, 11, 6, 5, 10,  6, 10, 11, 6, 6, 2, 5, 30, 37, 32, 35, 37, 41, 42, 34, 33, 32, 35, 33,   27,  35,  15, 19, 21, 18, 22, 26, 22, 28, 29, 30, 19, 21, 19, 26,  6, 8, 6, 7, 17, 11, 14, 17, 22, 27, 24, 18, 28, 24,  6, 10, 17, 18,  13, 10, 7, 18, 19, 22, 17, 21, 15, 23, 21, 27, 28, 28, 26, 17, 25, 10, 18, 20, 18, 12, 19, 17,  6, 15, 15, 15, 10, 14, 2, 30, 28, 29, 31, 24, 20, 25), nrow = 50, byrow = TRUE)
li_df <- data.frame(employees_views, row.names = c("employee_1", "employee_2", "employee_3", "employee_4", "employee_5", "employee_6", "employee_7", "employee_8", "employee_9", "employee_10", "employee_11", "employee_12", "employee_13", "employee_14", "employee_15", "employee_16", "employee_17", "employee_18", "employee_19", "employee_20", "employee_21", "employee_22", "employee_23", "employee_24", "employee_25", "employee_26", "employee_27", "employee_28", "employee_29", "employee_30", "employee_31", "employee_32", "employee_33", "employee_34", "employee_35", "employee_36", "employee_37", "employee_38", "employee_39", "employee_40", "employee_41", "employee_42", "employee_43", "employee_44", "employee_45", "employee_46", "employee_47", "employee_48", "employee_49", "employee_50") )
names(li_df)[1] <- "day1"
names(li_df)[2] <- "day2"
names(li_df)[3] <- "day3"
names(li_df)[4] <- "day4"
names(li_df)[5] <- "day5"
names(li_df)[6] <- "day6"
names(li_df)[7] <- "day7"

Select the second column, named day2, from li_df: second

second <- li_df[, 2]

Build a logical vector, TRUE if value in second is extreme: extremes

extremes <- second < 5 | second > 25

Count the number of TRUEs in extremes

sum(extremes)
## [1] 16

Head over to the next video and learn how relational and logical operators can be used to alter the flow of your R scripts.

Video: Conditional Statements

The if statement

Before diving into some exercises on the if statement, have another look at its syntax:

if (condition) {
expr
}

Remember your vectors with social profile views? Let’s look at it from another angle. We create a variable called medium which gives information about the social website, and another called num_views which denotes the actual number of views that particular medium had on the last day of your recordings.

Defining these variables related to your last day of recordings

medium <- "LinkedIn"
num_views <- 14

Examine the if statement for medium

if (medium == "LinkedIn") {
  print("Showing LinkedIn information")
}
## [1] "Showing LinkedIn information"

Write the if statement for num_views

if (num_views > 15) {
  print("You are popular!")
}

Try to see what happens if you change the medium and num_views variables and run your code again. Let’s further customize these if statements in the next exercise.

Add an else

You can only use an else statement in combination with an if statement. The else statement does not require a condition; its corresponding code is simply run if all of the preceding conditions in the control structure are FALSE. Here’s a recipe for its usage:

if (condition) {
expr1
} else {
expr2
}

It’s important that the else keyword comes on the same line as the closing bracket of the if part!

We will now extend the if statements that we coded in the previous exercises with the appropriate else statements!

Control structure for medium

if (medium == “LinkedIn”) { print(“Showing LinkedIn information”) } else { print(“Unknown medium”) }

Control structure for num_views

if (num_views > 15) { print(“You’re popular!”) } else { print(“Try to be more visible!”) }

You also had Facebook information available, remember? Time to add some more statements to our control structures using else if!

Customize further: else if

The else if statement allows you to further customize your control structure. You can add as many else if statements as you like. Keep in mind that R ignores the remainder of the control structure once a condition has been found that is TRUE and the corresponding expressions have been executed. Here’s an overview of the syntax to freshen your memory:

if (condition1) {
expr1
} else if (condition2) {
expr2
} else if (condition3) {
expr3
} else {
expr4
}

Again, it’s important that the else if keywords come on the same line as the closing bracket of the previous part of the control construct.

Control structure for medium

if (medium == "LinkedIn") {
  print("Showing LinkedIn information")
} else if (medium == "Facebook") {
  print("Showing Facebook information")
} else {
  print("Unknown medium")
}
## [1] "Showing LinkedIn information"

Control structure for num_views

if (num_views > 15) {
  print("You're popular!")
} else if (num_views <= 15 & num_views > 10) {
  print("Your number of views is average")
} else {
  print("Try to be more visible!")
}
## [1] "Your number of views is average"

Have another look at the second control structure. Because R abandons the control flow as soon as it finds a condition that is met, you can simplify the condition for the else if part in the second construct to num_views > 10.

Question: Else if 2.0

Take control!

In this exercise, you will combine everything that you’ve learned so far: relational operators, logical operators and control constructs. You’ll need it all!

Define li and fb

li <- 15
fb <- 9

These two variables, li and fb denote the number of profile views your LinkedIn and Facebook profile had on the last day of recordings. Go through the instructions to create R code that generates a ‘social media score’, sms, based on the values of li and fb.

Code the control-flow construct
if (li >= 15 & fb >= 15) {
  sms <- 2 * (li + fb)
} else if (li < 10 & fb < 10) {
  sms <- 0.5 * (li + fb)
} else {
  sms <- li + fb
}

2: Loops

Loops can come in handy on numerous occasions. While loops are like repeated if statements, the for loop is designed to iterate over all elements in a sequence. Learn about them in this chapter.

Video: While loop

Write a while loop

Let’s get you started with building a while loop from the ground up. Have another look at its recipe:

while (condition) {
expr
}

Remember that the condition part of this recipe should becomeFALSEat some point during the execution. Otherwise, thewhile` loop will go on indefinitely.

If your session expires when you run your code, check the body of your while loop carefully.

Have a look at the code below; it initializes the speed variables and already provides a while loop template to get you started.

Initialize the speed variable

speed <- 64

Code the while loop

while (speed > 30) {
  print("Slow down!")
  speed <- speed - 7
}
## [1] "Slow down!"
## [1] "Slow down!"
## [1] "Slow down!"
## [1] "Slow down!"
## [1] "Slow down!"

Throw in more conditionals

In the previous exercise, you simulated the interaction between a driver and a driver’s assistant: When the speed was too high, “Slow down!” got printed out to the console, resulting in a decrease of your speed by 7 units.

There are several ways in which you could make your driver’s assistant more advanced. For example, the assistant could give you different messages based on your speed or provide you with a current speed at a given moment.

A while loop similar to the one you’ve coded in the previous exercise is already available in the editor. It prints out your current speed, but there’s no code that decreases the speed variable yet, which is pretty dangerous. Can you make the appropriate changes?

Note that we’ll need to assign the value of 64 to the variable speed, as it currently has the value 29.

Initialize the speed variable

speed <- 64

Extend/adapt the while loop

while (speed > 30) {
  print(paste("Your speed is",speed))
  if (speed > 48) {
    print("Slow down big time!")
    speed <- speed - 11
  } else {
    print("Slow down!")
    speed <- speed - 6
  }
}
## [1] "Your speed is 64"
## [1] "Slow down big time!"
## [1] "Your speed is 53"
## [1] "Slow down big time!"
## [1] "Your speed is 42"
## [1] "Slow down!"
## [1] "Your speed is 36"
## [1] "Slow down!"

To further improve our driver assistant model, head over to the next exercise!

Stop the while loop: break

There are some very rare situations in which severe speeding is necessary: what if a hurricane is approaching and you have to get away as quickly as possible? You don’t want the driver’s assistant sending you speeding notifications in that scenario, right?

This seems like a great opportunity to include the break statement in the while loop you’ve been working on. Remember that the break statement is a control statement. When R encounters it, the while loop is abandoned completely.

Once again, we begin by initialising the speed variable.

Initialize the speed variable

speed <- 88

Adding a break to our while loop

while (speed > 30) {
  print(paste("Your speed is", speed))
  if (speed > 80) {
    break
  }
  if (speed > 48) {
    print("Slow down big time!")
    speed <- speed - 11
  } else {
    print("Slow down!")
    speed <- speed - 6
  }
}
## [1] "Your speed is 88"

Now that you’ve correctly solved this exercise, feel free to play around with different values of speed to see how the while loop handles the different cases.

Build a while loop from scratch

The previous exercises guided you through developing a pretty advanced while loop, containing a break statement and different messages and updates as determined by control flow constructs. If you manage to solve this comprehensive exercise using a while loop, you’re totally ready for the next topic: the for loop.

Initialize i as 1

i <- 1

Code the while loop

while (i <= 10) {
  print(3 * i)
  if (3 * i %% 8 == 0) {
    break
  }
  i <- i + 1
}
## [1] 3
## [1] 6
## [1] 9
## [1] 12
## [1] 15
## [1] 18
## [1] 21
## [1] 24

Head over to the next video!

Video: For loop

Loop over a vector In the previous video, Filip told you about two different strategies for using the for loop. To refresh your memory, consider the following loops that are equivalent in R:

primes <- c(2, 3, 5, 7, 11, 13)

# loop version 1
for (p in primes) {
print(p)
}

# loop version 2
for (i in 1:length(primes)) {
print(primes[i])
}

Remember our linkedin vector? It’s a vector that contains the number of views your LinkedIn profile had in the last seven days. Let’s remember ourselves of the vector.

Loop version 1

for(elements in linkedin) {
  print (elements)
}
## [1] 16
## [1] 9
## [1] 13
## [1] 5
## [1] 2
## [1] 17
## [1] 14

Loop version 2

for(i in 1:length(linkedin)) {
  print (linkedin[i])
}
## [1] 16
## [1] 9
## [1] 13
## [1] 5
## [1] 2
## [1] 17
## [1] 14

Loop over a list

Looping over a list is just as easy and convenient as looping over a vector. There are again two different approaches here:

primes_list <- list(2, 3, 5, 7, 11, 13)

#### loop version 1
for (p in primes_list) {
print(p)
}

#### loop version 2
for (i in 1:length(primes_list)) {
print(primes_list[[i]])
}

Recall from earlier that to select elements from lists, rather than single square brackets, we need double square brackets [[ ]]. You will see them again in loop version 2 above.

Suppose you have a list of all sorts of information on New York City: its population size, the names of the boroughs, and whether it is the capital of the United States. We first prepare a list nyc with all this information (source: Wikipedia).

Specify nyc list

nyc <- list(pop = 8405837, 
            boroughs = c("Manhattan", "Bronx", "Brooklyn", "Queens", "Staten Island"), 
            capital = FALSE)

Loop version 1

for(elements in nyc) {
  print(elements)
}
## [1] 8405837
## [1] "Manhattan"     "Bronx"         "Brooklyn"      "Queens"       
## [5] "Staten Island"
## [1] FALSE

Loop version 2

for(i in 1:length(nyc)) {
  print(nyc[[i]])
}
## [1] 8405837
## [1] "Manhattan"     "Bronx"         "Brooklyn"      "Queens"       
## [5] "Staten Island"
## [1] FALSE

Filip mentioned that for loops can also be used for matrices. Let’s put that to a test in the next exercise.

Loop over a matrix

We’ll define a matrix ttt, that represents the status of a tic-tac-toe game. It contains the values “X”, “O” and “NA”. We’ll print out ttt in the console once it’s be defined to get a closer look. On row 1 and column 1, there’s “O”, while on row 3 and column 2 there’s “NA”.

Define ttt

ttt <- matrix(c("O", NA, "X", NA, "O", "O", "X", NA, "X"), byrow = TRUE, nrow = 3)
ttt
##      [,1] [,2] [,3]
## [1,] "O"  NA   "X" 
## [2,] NA   "O"  "O" 
## [3,] "X"  NA   "X"

To solve this exercise, you’ll need a for loop inside a for loop, often called a nested loop. Doing this in R is a breeze! Simply use the following recipe:

for (var1 in seq1) {
for (var2 in seq2) {
expr
}
}

define the double for loop

for (i in 1:nrow(ttt)) {
  for (j in 1:ncol(ttt)) {
    print(paste("On row ", i, " and column ", j, " the board contains ", ttt[i,j]))
  }
}
## [1] "On row  1  and column  1  the board contains  O"
## [1] "On row  1  and column  2  the board contains  NA"
## [1] "On row  1  and column  3  the board contains  X"
## [1] "On row  2  and column  1  the board contains  NA"
## [1] "On row  2  and column  2  the board contains  O"
## [1] "On row  2  and column  3  the board contains  O"
## [1] "On row  3  and column  1  the board contains  X"
## [1] "On row  3  and column  2  the board contains  NA"
## [1] "On row  3  and column  3  the board contains  X"

Notice that this loop when through the whole of row 1 before moving onto row 2. This makes sense, as the rows are the outer loop and columns are the inner loop.

You’re sufficiently comfortable with basic for looping, so it’s time to step it up a notch!

Mix it up with control flow

Let’s return to the LinkedIn profile views data, stored in a vector linkedin. In the first exercise on for loops you already did a simple printout of each element in this vector. A little more in-depth interpretation of this data wouldn’t hurt, right? Time to throw in some conditionals! As with the while loop, you can use the if and else statements inside the for loop.

Code the for loop with conditionals

for (li in linkedin) {
  if (li > 10) {
    print ("You're popular!")
  } else {
    print ("Be more visible!")
  }
  print(li)
}
## [1] "You're popular!"
## [1] 16
## [1] "Be more visible!"
## [1] 9
## [1] "You're popular!"
## [1] 13
## [1] "Be more visible!"
## [1] 5
## [1] "Be more visible!"
## [1] 2
## [1] "You're popular!"
## [1] 17
## [1] "You're popular!"
## [1] 14

In the next exercise, you’ll customize this for loop even further with break and next statements.

Next, you break it

In the editor on the right you’ll find a possible solution to the previous exercise. The code loops over the linkedin vector and prints out different messages depending on the values of li.

In this exercise, you will use the break and next statements:

The break statement abandons the active loop: the remaining code in the loop is skipped and the loop is not iterated over anymore. The next statement skips the remainder of the code in the loop, but continues the iteration.

Adapt/extend the for loop

for (li in linkedin) {
  if (li > 16) {
    print ("This is ridiculous, I'm outta here!")
    break
  }
  if (li < 5) {
    print ("This is too embarrassing!")
    next
  }
  if (li > 10) {
    print("You're popular!")
  } else {
    print("Be more visible!")
  }
  print(li)
}
## [1] "You're popular!"
## [1] 16
## [1] "Be more visible!"
## [1] 9
## [1] "You're popular!"
## [1] 13
## [1] "Be more visible!"
## [1] 5
## [1] "This is too embarrassing!"
## [1] "This is ridiculous, I'm outta here!"

for, break, next? We name it, you can do it!

Build a for loop from scratch

This exercise will not introduce any new concepts on for loops.

We first define a variable rquote, then split this variable up into a vector that contains separate letters, and store them in a vector chars using the strsplit() function.

Can you write code that counts the number of r’s that come before the first u in rquote?

Pre-defined variables

rquote <- "r's internals are irrefutably intriguing"
chars <- strsplit(rquote, split = "")[[1]]

Initialize rcount

rcount <- 0

Finish the for loop

for (char in chars) {
  if (char == "u") {
    break
  }
  if (char == "r") {
    rcount <- rcount + 1
  }
}

3: Functions

Functions are an extremely important concept in almost every programming language, and R is no different. Learn what functions are and how to use them—then take charge by writing your own functions.

Video: Introduction to functions

Function documentation

Before even thinking of using an R function, you should clarify which arguments it expects. All the relevant details such as a description, usage, and arguments can be found in the documentation. To consult the documentation on the sample() function, for example, you can use one of following R commands:

help(sample)
?sample

If you execute these commands in the console of the DataCamp interface, you’ll be redirected to www.rdocumentation.org. If you execute these commands in the console of an IDE (integrated development environment) such as RStudio, the documentation will open in the Help panel.

A quick hack to see the arguments of the sample() function is the args() function. Try it out in the console:

args(sample)

In the next exercises, you’ll be learning how to use the mean() function with increasing complexity. The first thing you’ll have to do is get acquainted with the mean() function.

Inspect the arguments of the mean() function

args(mean)
## function (x, ...) 
## NULL

That wasn’t too hard, was it? Take a look at the documentation and head over to the next exercise.

Use a function

The documentation on the mean() function gives us quite some information:

  • The mean() function computes the arithmetic mean.
  • The most general method takes multiple arguments: x and ....
  • The x argument should be a vector containing numeric, logical or time-related information. (Remember what we learnt about the numeric values of TRUE and FALSE to understand how you could take an average of logical values!)

Remember that R can match arguments both by position and by name. Can you still remember the difference? You’ll find out in this exercise!

Once more, you’ll be working with the view counts of your social network profiles for the past 7 days.

Calculate average number of views

avg_li <- mean(linkedin)
avg_fb <- mean(facebook)

Inspect avg_li and avg_fb

avg_li
## [1] 10.85714
avg_fb
## [1] 11.42857

I’m sure you’ve already called more advanced R functions in your history as a programmer. Now you also know what actually happens under the hood ;-)

Use a function (2)

Check the documentation on the mean() function again:

?mean

The Usage section of the documentation includes two versions of the mean() function. The first usage,

mean(x, ...)

is the most general usage of the mean function. The ‘Default S3 method’, however, is:

mean(x, trim = 0, na.rm = FALSE, ...)

The ... is called the ellipsis. It is a way for R to pass arguments along without the function having to name them explicitly. The ellipsis will be treated in more detail in future courses.

For the remainder of this exercise, just work with the second usage of the mean function. Notice that both trim and na.rm have default values. This makes them optional arguments.

Calculate the mean of the sum

avg_sum <- mean(linkedin + facebook)

Calculate the trimmed mean of the sum

avg_sum_trimmed <- mean(linkedin + facebook, trim = 0.2)

Inspect both new variables

avg_sum
## [1] 22.28571
avg_sum_trimmed
## [1] 22.6

When the trim argument is not zero, it chops off a fraction (equal to trim) of the vector you pass as argument x.

Use a function (3)

In the video, Filip guided you through the example of specifying arguments of the sd() function. The sd() function has an optional argument, na.rm that specified whether or not to remove missing values from the input vector before calculating the standard deviation.

If you’ve had a good look at the documentation, you’ll know by now that the mean() function also has this argument, na.rm, and it does the exact same thing. By default, it is set to FALSE, as the Usage of the default S3 method shows:

mean(x, trim = 0, na.rm = FALSE, ...)

Let’s see what happens if your vectors linkedin and facebook contain missing values (NA).

The linkedin and facebook vectors have been amended to include some NA’s

linkedin <- c(16, 9, 13, 5, NA, 17, 14)
facebook <- c(17, NA, 5, 16, 8, 13, 14)

Basic average of linkedin

mean(linkedin)
## [1] NA

Advanced average of linkedin

mean(linkedin, na.rm = TRUE)
## [1] 12.33333

Functions inside functions

You already know that R functions return objects that you can then use somewhere else. This makes it easy to use functions inside functions, as you’ve seen before:

speed <- 31
print(paste("Your speed is", speed))

Notice that both the print() and paste() functions use the ellipsis - … - as an argument. Can you figure out how they’re used?

Calculate the mean absolute deviation

mean(abs(linkedin - facebook), na.rm = TRUE)
## [1] 4.8

Question: Required, or optional?

Using functions that are already available in R is pretty straightforward, but how about writing your own functions to supercharge your R programs? The next video will tell you how.

Video: Writing functions

Write your own function

Wow, things are getting serious… you’re about to write your own function! Before you have a go at it, have a look at the following function template:

my_fun <- function(arg1, arg2) {
body
}

Notice that this recipe uses the assignment operator (<-) just as if you were assigning a vector to a variable for example. This is not a coincidence. Creating a function in R basically is the assignment of a function object to a variable! In the recipe above, you’re creating a new R variable my_fun, that becomes available in the workspace as soon as you execute the definition. From then on, you can use the my_fun as a function.

Create a function pow_two()

pow_two <- function(x) {
  x ^ 2
}

Use the function

pow_two(12)
## [1] 144

Create a function sum_abs()

sum_abs <- function(x, y) {
  abs(x) + abs(y)
}

Use the function

sum_abs(-2, 3)
## [1] 5

Step it up a notch in the next exercise!

Write your own function (2)

There are situations in which your function does not require an input. Let’s say you want to write a function that gives us the random outcome of throwing a fair die:

throw_die <- function() {
number <- sample(1:6, size = 1)
number
}

throw_die()

Up to you to code a function that doesn’t take any arguments!

Define the function hello()

hello <- function() {
  print("Hi there!")
  return(TRUE)
}

Call the function hello()

hello()
## [1] "Hi there!"
## [1] TRUE

Write your own function (3)

Do you still remember the difference between an argument with and without default values? Have another look at the sd() function by typing ?sd in the console. The usage section shows the following information:

sd(x, na.rm = FALSE)

This tells us that x has to be defined for the sd() function to be called correctly, however, na.rm already has a default value. Not specifying this argument won’t cause an error.

You can define default argument values in your own R functions as well. You can use the following recipe to do so:

my_fun <- function(arg1, arg2 = val2) {
body
}

The editor on the right already includes an extended version of the pow_two() function from before. Can you finish it?

Finish the pow_two() function

pow_two <- function(x, print_info = TRUE) {
  y <- x ^ 2
  if (print_info == TRUE) {
    print(paste(x, "to the power two equals", y))
  }
  return(y)
}

Playing around with pow_two’s new argument

pow_two(12)
## [1] "12 to the power two equals 144"
## [1] 144
pow_two(12, print_info = TRUE)
## [1] "12 to the power two equals 144"
## [1] 144
pow_two(12, print_info = FALSE)
## [1] 144

Have you tried calling this pow_two() function? Try pow_two(5), pow_two(5, TRUE) and pow_two(5, FALSE). Which ones give different results?

Question: Function scoping

Normally I don’t write the question text here. However, in the case of this question, I think it’s useful. The question goes like this…

An issue that Filip did not discuss in the video is function scoping. It implies that variables that are defined inside a function are not accessible outside that function. Try running the following code and see if you understand the results:

pow_two <- function(x) {
  y <- x ^ 2
  return(y)
}
pow_two(4)
## [1] 16

Did you trying calling y and x? Did you receive an error? y was defined inside the pow_two() function and therefore it is not accessible outside of that function. This is also true for the function’s arguments of course - x in this case.

If you’re familiar with other programming languages, you might wonder whether R passes arguments by value or by reference. Find out in the next exercise!

Question: R passes arguments by value

Once again, the text of this question is quite useful to us, so I’ll reprint it.

The title gives it away already: R passes arguments by value. What does this mean? Simply put, it means that an R function cannot change the variable that you input to that function. Let’s look at a simple example (try it in the console):

triple <- function(x) {
  x <- 3*x
  x
}
a <- 5
triple(a)
## [1] 15
a
## [1] 5

Inside the triple() function, the argument x gets overwritten with its value times three. Afterwards this new x is returned. If you call this function with a variable a set equal to 5, you obtain 15. But did the value of a change? If R were to pass a to triple() by reference, the override of the x inside the function would ripple through to the variable a, outside the function. However, R passes by value, so the R objects you pass to a function can never change unless you do an explicit assignment. a remains equal to 5, even after calling triple(a).

Given that R passes arguments by value and not by reference, the value of count is not changed after the first two calls of increment(). Only in the final expression, where count is re-assigned explicitly, does the value of count change.

R you functional?

Now that you’ve acquired some skills in defining functions with different types of arguments and return values, you should try to create more advanced functions. As you’ve noticed in the previous exercises, it’s perfectly possible to add control-flow constructs, loops and even other functions to your function body.

Remember our social media example, using the vectors linkedin and facebook? As a first step, you will be writing a function that can interpret a single value of this vector. In the next exercise, you will write another function that can handle an entire vector at once.

Note that the linkedin and facebook vectors will be returned to their original forms (without NAs).

Define linkedin and facebook

linkedin <- c(16, 9, 13, 5, 2, 17, 14)
facebook <- c(17, 7, 5, 16, 8, 13, 14)

Define the interpret function

interpret <- function(num_views) {
  if (num_views > 15) {
    print("You're popular!")
    return (num_views)
  } else {
    print("Try to be more visible!")
    return(0)
  }
}

Call the interpret function twice

interpret(linkedin[1])
## [1] "You're popular!"
## [1] 16
interpret(facebook[2])
## [1] "Try to be more visible!"
## [1] 0

The annoying thing here is that interpret() only takes one argument. Proceed to the next exercise to implement something more useful.

R you functional? (2)

A possible implementation of the interpret() function is already available in the editor. In this exercise you’ll be writing another function that will use the interpret() function to interpret all the data from your daily profile views inside a vector. Furthermore, your function will return the sum of views on popular days, if asked for. A for loop is ideal for iterating over all the vector elements. The ability to return the sum of views on popular days is something you can code through a function argument with a default value.

The interpret() can be used inside interpret_all()

interpret <- function(num_views) {
  if (num_views > 15) {
    print("You're popular!")
    return(num_views)
  } else {
    print("Try to be more visible!")
    return(0)
  }
}

Define the interpret_all() function

views: vector with data to interpret

Call the interpret_all() function on both linkedin and facebook

interpret_all(linkedin)
## [1] "You're popular!"
## [1] "Try to be more visible!"
## [1] "Try to be more visible!"
## [1] "Try to be more visible!"
## [1] "Try to be more visible!"
## [1] "You're popular!"
## [1] "Try to be more visible!"
## [1] 33
interpret_all(facebook)
## [1] "You're popular!"
## [1] "Try to be more visible!"
## [1] "Try to be more visible!"
## [1] "You're popular!"
## [1] "Try to be more visible!"
## [1] "Try to be more visible!"
## [1] "Try to be more visible!"
## [1] 33

Have a look at the results; it appears that the sum of views on popular days are the same for Facebook and LinkedIn, what a coincidence! Your different social profiles must be fairly balanced ;-) Head over to the next video!

Video: R packages

Load an R Package

There are basically two extremely important functions when it comes down to R packages:

  • install.packages(), which as you can expect, installs a given package.
  • library() which loads packages, i.e. attaches them to the search list on your R workspace.

To install packages, you need administrator privileges. This means that install.packages() will thus not work in the DataCamp interface. However, almost all CRAN packages are installed on our servers. You can load them with library().

In this exercise, you’ll be learning how to load the ggplot2 package, a powerful package for data visualization. You’ll use it to create a plot of two variables of the mtcars data frame. The data has already been prepared for you in the workspace.

Before starting, execute the following commands in the console:

  • search(), to look at the currently attached packages and
  • qplot(mtcars$wt, mtcars$hp), to build a plot of two variables of the mtcars data frame.

An error should occur, because you haven’t loaded the ggplot2 package yet!

Load the ggplot2 package

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.3

Retry the qplot() function

qplot(mtcars$wt, mtcars$hp)

Check out the currently attached packages again

search()
##  [1] ".GlobalEnv"        "package:ggplot2"   "package:stats"    
##  [4] "package:graphics"  "package:grDevices" "package:utils"    
##  [7] "package:datasets"  "package:methods"   "Autoloads"        
## [10] "package:base"

Notice how search() and library() are closely interconnected functions. Head over to the next exercise.

Question: Different ways to load a package

The library() and require() functions are not very picky when it comes down to argument types: both library(rjson) and library("rjson") work perfectly fine for loading a package.

Only chunk 1 and chunk 2 are correct. Can you figure out why the last two aren’t valid? The warning you receive with chunk 4 makes it quite clear what’s wrong there. For chunk 3, it seems that the original author of the require() function wanted to allow people to be lazy, and not have to enclose the package name with quote marks "". To do this, they include a default setting within require(). View this using the args()function. Can you see why the changing on this default setting in chunk 4 combined with the lack of quotation marks throws an error?

This exercise concludes the chapter on functions. Well done!


4: The apply family

Whenever you’re using a for loop, you may want to revise your code to see whether you can use the lapply function instead. Learn all about this intuitive way of applying a function over a list or a vector, and how to use its variants, sapply and vapply.

Use lapply with a built-in R function

Before you go about solving the exercises below, have a look at the documentation of the lapply() function. The Usage section shows the following expression:

lapply(X, FUN, ...)

To put it generally, lapply takes a vector or list X, and applies the function FUN to each of its members. If FUN requires additional arguments, you pass them after you’ve specified X and FUN (in the ... part). The output of lapply() is a list, the same length as X, where each element is the result of applying FUN on the corresponding element of X.

Now that you are truly brushing up on your data science skills, let’s revisit some of the most relevant figures in data science history. We’ve compiled a vector of famous mathematicians/statisticians and the year they were born. Up to you to extract some information!

The vector pioneers has already been created for you

pioneers <- c("GAUSS:1777", "BAYES:1702", "PASCAL:1623", "PEARSON:1857")

Split names from birth year

split_math <- strsplit(pioneers, split = ":")

Convert to lowercase strings: split_low

split_low <- lapply(split_math, tolower)

Take a look at the structure of split_low

str(split_low)
## List of 4
##  $ : chr [1:2] "gauss" "1777"
##  $ : chr [1:2] "bayes" "1702"
##  $ : chr [1:2] "pascal" "1623"
##  $ : chr [1:2] "pearson" "1857"

Use lapply with your own function

As Filip explained in the instructional video, you can use lapply() on your own functions as well. You just need to code a new function and make sure it is available in the workspace. After that, you can use the function inside lapply() just as you did with base R functions.

In the previous exercise you already used lapply() once to convert the information about your favorite pioneering statisticians to a list of vectors composed of two character strings. Let’s write some code to select the names and the birth years separately.

The sample code already includes code that defined select_first(), that takes a vector as input and returns the first element of this vector.

Write function select_first()

select_first <- function(x) {
  x[1]
}

Apply select_first() over split_low: names

names <- lapply(split_low, select_first)

Write function select_second()

select_second <- function(x) {
  x[2]
}

Apply select_second() over split_low: years

years <- lapply(split_low, select_second)

Head over to the next exercise to learn about anonymous functions.

lapply and anonymous functions

Writing your own functions and then using them inside lapply() is quite an accomplishment! But defining functions to use them only once is kind of overkill, isn’t it? That’s why you can use so-called anonymous functions in R.

Previously, you learned that functions in R are objects in their own right. This means that they aren’t automatically bound to a name. When you create a function, you can use the assignment operator to give the function a name. It’s perfectly possible, however, to not give the function a name. This is called an anonymous function:

# Named function
triple <- function(x) { 3 * x }

# Anonymous function with same implementation
function(x) { 3 * x }
## function(x) { 3 * x }
# Use anonymous function inside lapply()
lapply(list(1,2,3), function(x) { 3 * x })
## [[1]]
## [1] 3
## 
## [[2]]
## [1] 6
## 
## [[3]]
## [1] 9

Transform: use anonymous function inside lapply

names <- lapply(split_low, function(x) { x[1] } )

Transform: use anonymous function inside lapply

years <- lapply(split_low, function(x) { x[2] })

Now, there’s another way to solve the issue of using the select_*() functions only once: you can make a more generic function that can be used in more places. Find out more about this in the next exercise.

Use lapply with additional arguments

In the video, the triple() function was transformed to the multiply() function to allow for a more generic approach. lapply() provides a way to handle functions that require more than one argument, such as the multiply() function:

multiply <- function(x, factor) {
  x * factor
}
lapply(list(1,2,3), multiply, factor = 3)
## [[1]]
## [1] 3
## 
## [[2]]
## [1] 6
## 
## [[3]]
## [1] 9

On the right we’ve included a generic version of the select functions that you’ve coded earlier: select_el(). It takes a vector as its first argument, and an index as its second argument. It returns the vector’s element at the specified index.

Generic select function

select_el <- function(x, index) {
  x[index]
}

Use lapply() twice on split_low: names and years

names <- lapply(split_low, select_el, index = 1)
years <- lapply(split_low, select_el, index = 2)

Your lapply skills are growing by the minute!

Apply functions that return NULL

In all of the previous exercises, it was assumed that the functions that were applied over vectors and lists actually returned a meaningful result. For example, the tolower() function simply returns the strings with the characters in lowercase. This won’t always be the case. Suppose you want to display the structure of every element of a list. You could use the str() function for this, which returns NULL:

lapply(list(1, "a", TRUE), str)
##  num 1
##  chr "a"
##  logi TRUE
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL

This call actually returns a list, the same size as the input list, containing all NULL values. On the other hand calling

str(TRUE)
##  logi TRUE

on its own prints only the structure of the logical to the console, not NULL. That’s because str() uses invisible() behind the scenes, which returns an invisible copy of the return value, NULL in this case. This prevents it from being printed when the result of str() is not assigned.

Feel free to experiment some more with your code in the console. Did you notice that lapply() always returns a list, no matter the input? This can be kind of annoying. In the next video tutorial you’ll learn about sapply() to solve this.

Video: sapply()

How to use sapply

You can use sapply() similar to how you used lapply(). The first argument of sapply() is the list or vector X over which you want to apply a function, FUN. Potential additional arguments to this function are specified afterwards (...):

sapply(X, FUN, ...)

In the next couple of exercises, you’ll be working with the variable temp, that contains temperature measurements for 7 days. temp is a list of length 7, where each element is a vector of length 5, representing 5 measurements on a given day.

Define temp

temp <- list(c(3, 7, 9, 6, -1), c(6, 9, 12, 13, 5), c(4, 8, 3, -1, -3), c(1, 4, 7, 2, -2), c(5, 7, 9, 4, 2), c(-3, 5, 8, 9, 4), c(3, 6, 9, 4, 1))

View structure of temp

str(temp)
## List of 7
##  $ : num [1:5] 3 7 9 6 -1
##  $ : num [1:5] 6 9 12 13 5
##  $ : num [1:5] 4 8 3 -1 -3
##  $ : num [1:5] 1 4 7 2 -2
##  $ : num [1:5] 5 7 9 4 2
##  $ : num [1:5] -3 5 8 9 4
##  $ : num [1:5] 3 6 9 4 1

Use lapply() to find each day’s minimum temperature

lapply(temp, min)
## [[1]]
## [1] -1
## 
## [[2]]
## [1] 5
## 
## [[3]]
## [1] -3
## 
## [[4]]
## [1] -2
## 
## [[5]]
## [1] 2
## 
## [[6]]
## [1] -3
## 
## [[7]]
## [1] 1

Use sapply() to find each day’s minimum temperature

sapply(temp, min)
## [1] -1  5 -3 -2  2 -3  1

Use lapply() to find each day’s maximum temperature

lapply(temp, max)
## [[1]]
## [1] 9
## 
## [[2]]
## [1] 13
## 
## [[3]]
## [1] 8
## 
## [[4]]
## [1] 7
## 
## [[5]]
## [1] 9
## 
## [[6]]
## [1] 9
## 
## [[7]]
## [1] 9

Use sapply() to find each day’s maximum temperature

sapply(temp, max)
## [1]  9 13  8  7  9  9  9

Can you tell the difference between the output of lapply() and sapply()? The former returns a list, while the latter returns a vector that is a simplified version of this list. Notice that this time, unlike in the cities example of the instructional video, the vector is not named.

sapply with your own function

Like lapply(), sapply() allows you to use self-defined functions and apply them over a vector or a list:

sapply(X, FUN, ...)

Here, FUN can be one of R’s built-in functions, but it can also be a function you wrote. This self-written function can be defined before hand, or can be inserted directly as an anonymous function.

Finish function definition of extremes_avg

extremes_avg <- function(x) {
  ( min(x) + max(x) ) / 2
}

Apply extremes_avg() over temp using sapply()

sapply(temp, extremes_avg)
## [1] 4.0 9.0 2.5 2.5 5.5 3.0 5.0

Apply extremes_avg() over temp using lapply()

lapply(temp, extremes_avg)
## [[1]]
## [1] 4
## 
## [[2]]
## [1] 9
## 
## [[3]]
## [1] 2.5
## 
## [[4]]
## [1] 2.5
## 
## [[5]]
## [1] 5.5
## 
## [[6]]
## [1] 3
## 
## [[7]]
## [1] 5

Of course, you could have solved this exercise using an anonymous function, but this would require you to use the code inside the definition of extremes_avg() twice. Duplicating code should be avoided as much as possible!

sapply with function returning vector

In the previous exercises, you’ve seen how sapply() simplifies the list that lapply() would return by turning it into a vector. But what if the function you’re applying over a list or a vector returns a vector of length greater than 1? If you don’t remember from the video, don’t waste more time in the valley of ignorance and head over to the instructions!

Create a function that returns min and max of a vector: extremes

extremes <- function(x) {
  c(min = min(x), max = max(x))
}

Apply extremes() over temp with sapply()

sapply(temp, extremes)
##     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## min   -1    5   -3   -2    2   -3    1
## max    9   13    8    7    9    9    9

Apply extremes() over temp with lapply()

lapply(temp, extremes)
## [[1]]
## min max 
##  -1   9 
## 
## [[2]]
## min max 
##   5  13 
## 
## [[3]]
## min max 
##  -3   8 
## 
## [[4]]
## min max 
##  -2   7 
## 
## [[5]]
## min max 
##   2   9 
## 
## [[6]]
## min max 
##  -3   9 
## 
## [[7]]
## min max 
##   1   9

Have a final look at the console and see how sapply() did a great job at simplifying the rather uninformative ‘list of vectors’ that lapply() returns. It actually returned a nicely formatted matrix!

sapply can’t simplify, now what?

It seems like we’ve hit the jackpot with sapply(). On all of the examples so far, sapply() was able to nicely simplify the rather bulky output of lapply(). But, as with life, there are things you can’t simplify. How does sapply() react?

We already created a function, below_zero(), that takes a vector of numerical values and returns a vector that only contains the values that are strictly below zero.

Definition of below_zero()

below_zero <- function(x) {
  return(x[x < 0])
}

Apply below_zero over temp using sapply(): freezing_s

freezing_s <- sapply(temp, below_zero)

Apply below_zero over temp using lapply(): freezing_l

freezing_l <- lapply(temp, below_zero)

Are freezing_s and freezing_l identical?

identical(freezing_s, freezing_l)
## [1] TRUE

Given that the length of the output of below_zero() changes for different input vectors, sapply() is not able to nicely convert the output of lapply() to a nicely formatted matrix. Instead, the output values of sapply() and lapply() are exactly the same, as shown by the TRUE output of identical().

sapply with functions that return NULL

You already have some apply tricks under your sleeve, but you’re surely hungry for some more, aren’t you? In this exercise, you’ll see how sapply() reacts when it is used to apply a function that returns NULL over a vector or a list.

A function print_info(), that takes a vector and prints the average of this vector, has already been created for you. It uses the cat() function.

Definition of print_info()

print_info <- function(x) {
  cat("The average temperature is", mean(x), "\n")
}

Apply print_info() over temp using sapply()

sapply(temp, print_info)
## The average temperature is 4.8 
## The average temperature is 9 
## The average temperature is 2.2 
## The average temperature is 2.4 
## The average temperature is 5.4 
## The average temperature is 4.6 
## The average temperature is 4.6
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
## 
## [[6]]
## NULL
## 
## [[7]]
## NULL

Apply print_info() over temp using lapply()

lapply(temp, print_info)
## The average temperature is 4.8 
## The average temperature is 9 
## The average temperature is 2.2 
## The average temperature is 2.4 
## The average temperature is 5.4 
## The average temperature is 4.6 
## The average temperature is 4.6
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
## 
## [[6]]
## NULL
## 
## [[7]]
## NULL

Notice here that, quite surprisingly, sapply() does not simplify the list of NULL‘s. That’s because the ’vector-version’ of a list of NULL’s would simply be a NULL, which is no longer a vector with the same length as the input. Proceed to the next exercise.

Reverse engineering sapply

This concludes the exercise set on sapply(). Head over to another video to learn all about vapply()!

Video: vapply

Use vapply

Before you get your hands dirty with the third and last apply function that you’ll learn about in this intermediate R course, let’s take a look at its syntax. The function is called vapply(), and it has the following syntax:

vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)

Over the elements inside X, the function FUN is applied. The FUN.VALUE argument expects a template for the return argument of this function FUN. USE.NAMES is TRUE by default; in this case vapply() tries to generate a named array, if possible.

For the next set of exercises, you’ll be working on the temp list again, that contains 7 numerical vectors of length 5. We also coded a function basics() that takes a vector, and returns a named vector of length 3, containing the minimum, mean and maximum value of the vector respectively.

Definition of basics()

basics <- function(x) {
  c(min = min(x), mean = mean(x), max = max(x))
}

Apply basics() over temp using vapply()

vapply(temp, basics, numeric(3))
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## min  -1.0    5 -3.0 -2.0  2.0 -3.0  1.0
## mean  4.8    9  2.2  2.4  5.4  4.6  4.6
## max   9.0   13  8.0  7.0  9.0  9.0  9.0

Notice how, just as with sapply(), vapply() neatly transfers the names that you specify in the basics() function to the row names of the matrix that it returns.

Use vapply (2)

So far you’ve seen that vapply() mimics the behavior of sapply() if everything goes according to plan. But what if it doesn’t?

In the video, Filip showed you that there are cases where the structure of the output of the function you want to apply, FUN, does not correspond to the template you specify in FUN.VALUE. In that case, vapply() will throw an error that informs you about the misalignment between expected and actual output.

Definition of the basics() function

basics <- function(x) {
  c(min = min(x), mean = mean(x), median = median(x), max = max(x))
}

Fix the error:

vapply(temp, basics, numeric(4))
##        [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## min    -1.0    5 -3.0 -2.0  2.0 -3.0  1.0
## mean    4.8    9  2.2  2.4  5.4  4.6  4.6
## median  6.0    9  3.0  2.0  5.0  5.0  4.0
## max     9.0   13  8.0  7.0  9.0  9.0  9.0

From sapply to vapply

As highlighted before, vapply() can be considered a more robust version of sapply(), because you explicitly restrict the output of the function you want to apply. Converting your sapply() expressions in your own R scripts to vapply() expressions is therefore a good practice (and also a breeze!).

Convert to vapply() expression

vapply(temp, max, numeric(1))
## [1]  9 13  8  7  9  9  9

Convert to vapply() expression

vapply(temp, function(x, y) { mean(x) > y }, y = 5, logical(1))
## [1] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE

You’ve got no more excuses to use sapply() in the future!


5: Utilities

Mastering R programming is not only about understanding its programming concepts. Having a solid understanding of a wide range of R functions is also important. This chapter introduces you to many useful functions for data structure manipulation, regular expressions, and working with times and dates.

Video: Useful functions

Mathematical utilities

Have another look at some useful math functions that R features:

  • abs(): Calculate the absolute value.
  • sum(): Calculate the sum of all the values in a data structure.
  • mean(): Calculate the arithmetic mean.
  • round(): Round the values to 0 decimal places by default. Try out ?round in the console for variations of round() and ways to change the number of digits to round to.

As a data scientist in training, you’ve estimated a regression model on the sales data for the past six months. After evaluating your model, you see that the training error of your model is quite regular, showing both positive and negative values. The error values are already defined in the workspace on the right (errors).

The errors vector has already been defined for you

errors <- c(1.9, -2.6, 4.0, -9.5, -3.4, 7.3)

Sum of absolute rounded values of errors

sum(round(abs(errors)))
## [1] 29

Find the error

We went ahead and included some code on the right, but there’s still an error. Can you trace it and fix it?

In times of despair, help with functions such as sum() and rev() are a single command away; simply use ?sum and ?rev in the console.

Don’t edit these two lines

vec1 <- c(1.5, 2.5, 8.4, 3.7, 6.3)
vec2 <- rev(vec1)

Fix the error

mean(c(abs(vec1), abs(vec2)))
## [1] 4.48

If you check out the documentation of mean(), you’ll see that only the first argument, x, should be a vector. If you also specify a second argument, R will match the arguments by position and expect a specification of the trim argument. Therefore, merging the two vectors is a must!

Data Utilities

R features a bunch of functions to juggle around with data structures::

  • seq(): Generate sequences, by specifying the from, to, and by arguments.
  • rep(): Replicate elements of vectors and lists.
  • sort(): Sort a vector in ascending order. Works on numerics, but also on character strings and logicals.
  • rev(): Reverse the elements in a data structures for which reversal is defined.
  • str(): Display the structure of any R object.
  • append(): Merge vectors or lists.
  • is.*(): Check for the class of an R object.
  • as.*(): Convert an R object from one class to another.
  • unlist(): Flatten (possibly embedded) lists to produce a vector.

Remember the social media profile views data? We’ve use them again now, although in list form.

The linkedin and facebook lists have already been created for you

linkedin <- list(16, 9, 13, 5, 2, 17, 14)
facebook <- list(17, 7, 5, 16, 8, 13, 14)

Convert linkedin and facebook to a vector: li_vec and fb_vec

li_vec <- unlist(linkedin)
fb_vec <- unlist(facebook)

Append fb_vec to li_vec: social_vec

social_vec <- append(li_vec, fb_vec)

Sort social_vec

sort(social_vec, decreasing = TRUE)
##  [1] 17 17 16 16 14 14 13 13  9  8  7  5  5  2

These instructions required you to solve this challenge in a step-by-step approach. If you’re comfortable with the functions, you can combine some of these steps into powerful one-liners.

Find the error (2)

Just as before, let’s switch roles. It’s up to you to see what unforgivable mistakes we’ve made. Go fix them!

Fix me

rep(seq(1, 7, by = 2), times = 7)
##  [1] 1 3 5 7 1 3 5 7 1 3 5 7 1 3 5 7 1 3 5 7 1 3 5 7 1 3 5 7

Debugging code is also a big part of the daily routine of a data scientist, and you seem to be great at it!

Beat Gauss using R

There is a popular story about young Gauss. As a pupil, he had a lazy teacher who wanted to keep the classroom busy by having them add up the numbers 1 to 100. Gauss came up with an answer almost instantaneously, 5050. On the spot, he had developed a formula for calculating the sum of an arithmetic series. There are more general formulas for calculating the sum of an arithmetic series with different starting values and increments. Instead of deriving such a formula, why not use R to calculate the sum of a sequence?

Create first sequence: seq1

seq1 <- seq(1, 500, 3)

Create second sequence: seq2

seq2 <- seq(1200, 900, -7)

Calculate total sum of the sequences

sum(c(seq1, seq2))
## [1] 87029

Video: Regular expressions

grepl & grep

In their most basic form, regular expressions can be used to see whether a pattern exists inside a character string or a vector of character strings. For this purpose, you can use:

  • grepl(), which returns TRUE when a pattern is found in the corresponding character string.
  • grep(), which returns a vector of indices of the character strings that contains the pattern.

Both functions need a pattern and an x argument, where pattern is the regular expression you want to match for, and the x argument is the character vector from which matches should be sought.

In this and the following exercises, you’ll be querying and manipulating a character vector of email addresses! The vector emails has already been defined in the editor on the right so you can begin with the instructions straight away!

The emails vector has already been defined for you

emails <- c("john.doe@ivyleague.edu", "education@world.gov", "dalai.lama@peace.org",
            "invalid.edu", "quant@bigdatacollege.edu", "cookie.monster@sesame.tv")

Use grepl() to match for “edu”

grepl("edu", emails)
## [1]  TRUE  TRUE FALSE  TRUE  TRUE FALSE

Use grep() to match for “edu”, save result to hits

hits <- grep("edu", emails)

Subset emails using hits

emails[hits]
## [1] "john.doe@ivyleague.edu"   "education@world.gov"     
## [3] "invalid.edu"              "quant@bigdatacollege.edu"

You can probably guess what we’re trying to achieve here: select all the emails that end with “.edu”. However, the strings education@world.gov and invalid.edu were also matched. Let’s see in the next exercise what you can do to improve our pattern and remove these false positives.

grepl & grep (2)

You can use the caret, ^, and the dollar sign, $ to match the content located in the start and end of a string, respectively. This could take us one step closer to a correct pattern for matching only the “.edu” email addresses from our list of emails. But there’s more that can be added to make the pattern more robust:

  • @, because a valid email must contain an at-sign.
  • .*, which matches any character (the .) zero or more times (*). Both the dot and the asterisk are metacharacters. You can use them to match any character between the at-sign and the “.edu” portion of an email address.
  • \\.edu$, to match the “.edu” part of the email at the end of the string. The \\ part escapes the dot: it tells R that you want to use the . as an actual character.

Use grepl() to match for .edu addresses more robustly

grepl("@.*\\.edu$", emails)
## [1]  TRUE FALSE FALSE FALSE  TRUE FALSE

Use grep() to match for .edu addresses more robustly, save result to hits

hits <- grep("@.*\\.edu$", emails)

Subset emails using hits

emails[hits]
## [1] "john.doe@ivyleague.edu"   "quant@bigdatacollege.edu"

A careful construction of our regular expression leads to more meaningful matches. However, even our robust email selector will often match some incorrect email addresses (for instance kiara@@fakemail.edu). Let’s not worry about this too much and continue with sub() and gsub() to actually edit the email addresses!

sub & gsub

While grep() and grepl() were used to simply check whether a regular expression could be matched with a character vector, sub() and gsub() take it one step further: you can specify a replacement argument. If inside the character vector x, the regular expression pattern is found, the matching element(s) will be replaced with replacement. sub() only replaces the first match, whereas gsub() replaces all matches.

Suppose that emails vector you’ve been working with is an excerpt of DataCamp’s email database. Why not offer the owners of the .edu email addresses a new email address on the datacamp.edu domain? This could be quite a powerful marketing stunt: Online education is taking over traditional learning institutions! Convert your email and be a part of the new generation!

Use sub() to convert the email domains to datacamp.edu

sub("@.*\\.edu$", "@datacamp.edu", emails)
## [1] "john.doe@datacamp.edu"    "education@world.gov"     
## [3] "dalai.lama@peace.org"     "invalid.edu"             
## [5] "quant@datacamp.edu"       "cookie.monster@sesame.tv"

Notice how only the valid .edu addresses are changed while the other emails remain unchanged. To get a taste of other things you can accomplish with regex, head over to the next exercise.

sub & gsub (2)

Regular expressions are a typical concept that you’ll learn by doing and by seeing other examples. Before you rack your brains over the regular expression in this exercise, have a look at the new things that will be used:

  • .*: A usual suspect! It can be read as “any character that is matched zero or more times”.
  • \\s: Match a space. The “s” is normally a character, escaping it (\\) makes it a metacharacter.
  • [0-9]+: Match the numbers 0 to 9, at least once (+).
  • ([0-9]+): The parentheses are used to make parts of the matching string available to define the replacement. Refer to () references using \\1, \\2, etc. in the replacement argument of sub().

The ([0-9]+) selects the entire number that comes before the word “nomination” in the string, and the entire match gets replaced by this number because of the \\1 that refers to the content inside the parentheses. The next video will get you up to speed with times and dates in R!

Video: Times & Dates

Right here, right now

In R, dates are represented by Date objects, while times are represented by POSIXct objects. Under the hood, however, these dates and times are simple numerical values. Date objects store the number of days since the 1st of January in 1970. POSIXct objects on the other hand, store the number of seconds since the 1st of January in 1970.

The 1st of January in 1970 is the common origin for representing times and dates in a wide range of programming languages. There is no particular reason for this; it is a simple convention. Of course, it’s also possible to create dates and times before 1970; the corresponding numerical values are simply negative in this case.

Get the current date: today

today <- Sys.Date()

See what today looks like under the hood

unclass(today)
## [1] 18580

Get the current time: now

now <- Sys.time()

See what now looks like under the hood

unclass(now)
## [1] 1605345735

Using R to get the current date and time is nice, but you should also know how to create dates and times from character strings. Find out how in the next exercises!

Create and format dates

To create a Date object from a simple character string in R, you can use the as.Date() function. The character string has to obey a format that can be defined using a set of symbols (the examples correspond to 13 January, 1982):

  • %Y: 4-digit year (1982)
  • %y: 2-digit year (82)
  • %m: 2-digit month (01)
  • %d: 2-digit day of the month (13)
  • %A: weekday (Wednesday)
  • %a: abbreviated weekday (Wed)
  • %B: month (January)
  • %b: abbreviated month (Jan)

The following R commands will all create the same Date object for the 13th day in January of 1982:

as.Date("1982-01-13")
## [1] "1982-01-13"
as.Date("Jan-13-82", format = "%b-%d-%y")
## [1] "1982-01-13"
as.Date("13 January, 1982", format = "%d %B, %Y")
## [1] "1982-01-13"

Notice that the first line here did not need a format argument, because by default R matches your character string to the formats "%Y-%m-%d" or "%Y/%m/%d".

In addition to creating dates, you can also convert dates to character strings that use a different date notation. For this, you use the format() function. Try the following lines of code:

today <- Sys.Date()
format(Sys.Date(), format = "%d %B, %Y")
## [1] "14 November, 2020"
format(Sys.Date(), format = "Today is a %A!")
## [1] "Today is a Saturday!"

Definition of character strings representing dates

str1 <- "May 23, '96"
str2 <- "2012-03-15"
str3 <- "30/January/2006"

Convert the strings to dates: date1, date2, date3

date1 <- as.Date(str1, format = "%b %d, '%y")
date2 <- as.Date(str2, format = "%Y-%m-%d")
date3 <- as.Date(str3, format = "%d/%B/%Y")

Convert dates to formatted strings

format(date1, "%A")
## [1] "Thursday"
format(date2, "%d")
## [1] "15"
format(date3, "%b %Y")
## [1] "Jan 2006"

You can use POSIXct objects, i.e. Time objects in R, in a similar fashion. Give it a try in the next exercise.

Create and format times

Similar to working with dates, you can use as.POSIXct() to convert from a character string to a POSIXct object, and format() to convert from a POSIXct object to a character string. Again, you have a wide variety of symbols:

  • %H: hours as a decimal number (00-23)
  • %I: hours as a decimal number (01-12)
  • %M: minutes as a decimal number
  • %S: seconds as a decimal number
  • %T: shorthand notation for the typical format %H:%M:%S
  • %p: AM/PM indicator

For a full list of conversion symbols, consult the strptime documentation in the console.

Again, as.POSIXct() uses a default format to match character strings. In this case, it’s %Y-%m-%d %H:%M:%S. In this exercise, abstraction is made of different time zones.

Definition of character strings representing times

str1 <- "May 23, '96 hours:23 minutes:01 seconds:45"
str2 <- "2012-3-12 14:23:08"

Convert the strings to POSIXct objects: time1, time2

time1 <- as.POSIXct(str1, format = "%B %d, '%y hours:%H minutes:%M seconds:%S")
time2 <- as.POSIXct(str2, format = "%Y-%m-%d %T")

Convert times to formatted strings

format(time1, format = "%M")
## [1] "01"
format(time2, format = "%I:%M %p")
## [1] "02:23 PM"

Calculations with Dates

Both Date and POSIXct R objects are represented by simple numerical values under the hood. This makes calculation with time and date objects very straightforward: R performs the calculations using the underlying numerical values, and then converts the result back to human-readable time information again.

You can increment and decrement Date objects, or do actual calculations with them (try it out in the console!):

today <- Sys.Date()
today + 1
## [1] "2020-11-15"
today - 1
## [1] "2020-11-13"
as.Date("2015-03-12") - as.Date("2015-02-27")
## Time difference of 13 days

To control your eating habits, you decided to write down the dates of the last five days that you ate pizza. In the workspace, these dates are defined as five Date objects, day1 to day5. The code on the right also contains a vector pizza with these 5 Date objects.

day1, day2, day3, day4 and day5 are already available in the workspace

day1 <- as.Date("2020-09-13")
day2 <- as.Date("2020-09-15")
day3 <- as.Date("2020-09-20")
day4 <- as.Date("2020-09-26")
day5 <- as.Date("2020-10-01")

Difference between last and first pizza day

day5 - day1
## Time difference of 18 days

Create vector pizza

pizza <- c(day1, day2, day3, day4, day5)

Create differences between consecutive pizza days: day_diff

day_diff <- diff(pizza)

Average period between two consecutive pizza days

mean(day_diff)
## Time difference of 4.5 days

Calculations with Times

Calculations using POSIXct objects are completely analogous to those using Date objects. Try to experiment with this code to increase or decrease POSIXct objects:

now <- Sys.time()
now + 3600          # add an hour
## [1] "2020-11-14 10:22:16 GMT"
now - 3600 * 24     # subtract a day
## [1] "2020-11-13 09:22:16 GMT"

Adding or subtracting time objects is also straightforward:

birth <- as.POSIXct("1879-03-14 14:37:23")
death <- as.POSIXct("1955-04-18 03:47:12")
einstein <- death - birth
einstein
## Time difference of 27792.51 days

You’re developing a website that requires users to log in and out. You want to know what is the total and average amount of time a particular user spends on your website. This user has logged in 5 times and logged out 5 times as well. These times are gathered in the vectors login and logout, which are already defined in the workspace.

login and logout are already defined in the workspace

login <- c(as.POSIXct("2020-09-17 10:18:04 UTC"), as.POSIXct("2020-09-22 09:14:18 UTC"), as.POSIXct("2020-09-22 12:21:51 UTC"), as.POSIXct("2020-09-22 12:37:24 UTC"), as.POSIXct("2020-09-24 21:37:55 UTC"))
logout <- c(as.POSIXct("2020-09-17 10:56:29 UTC"), as.POSIXct("2020-09-22 09:14:52 UTC"), as.POSIXct("2020-09-22 12:35:48 UTC"), as.POSIXct("2020-09-22 13:17:22 UTC"), as.POSIXct("2020-09-24 22:08:47 UTC"))

Calculate the difference between login and logout: time_online

time_online <- logout - login

Inspect the variable time_online

time_online
## Time differences in secs
## [1] 2305   34  837 2398 1852

Calculate the total time online

sum(time_online)
## Time difference of 7426 secs

Calculate the average time online

mean(time_online)
## Time difference of 1485.2 secs

Time is of the essence

The dates when a season begins and ends can vary depending on who you ask. People in Australia will tell you that spring starts on September 1st. The Irish people in the Northern hemisphere will swear that spring starts on February 1st, with the celebration of St. Brigid’s Day. Then there’s also the difference between astronomical and meteorological seasons: while astronomers are used to equinoxes and solstices, meteorologists divide the year into 4 fixed seasons that are each three months long. (source: www.timeanddate.com)

A vector astro, which contains character strings representing the dates on which the 4 astronomical seasons start, has been defined on your workspace. Similarly, a vector meteo has already been created for you, with the meteorological beginnings of a season.

Define astro and meteo

astro <- c("20-Mar-2015", "25-Jun-2015", "23-Sep-2015", "22-Dec-2015")
names(astro) <- c("spring", "summer", "fall", "winter")
meteo <- c("March 1, 15", "June 1, 15", "September 1, 15", "December 1, 15")
names(meteo) <- c("spring", "summer", "fall", "winter")

Convert astro to vector of Date objects: astro_dates

astro_dates <- as.Date(astro, format = "%d-%b-%Y")

Convert meteo to vector of Date objects: meteo_dates

meteo_dates <- as.Date(meteo, format = "%B %d, %y")

Calculate the maximum absolute difference between astro_dates and meteo_dates

max(abs(astro_dates - meteo_dates))
## Time difference of 24 days

Impressive! Great job on finishing this course!


Module 3: R Markdown

R Markdown is an easy to use formatting language you can use to reveal insights from data and author your findings as a PDF, HTML file, or Shiny app. In this course, you’ll learn how to create and modify each element of a Markdown file, including the code, text, and metadata. You’ll analyze data with dplyr, create visualizations with ggplot2, and author your analyses and plots as reports. You’ll gain hands-on experience of building reports as you work with real-world data from the International Finance Corporation (IFC)—learning how to efficiently organize reports using code chunk options, create lists and tables, and include a table of contents. By the end of the course, you’ll have the skills you need to add your brand’s fonts and colors using parameters and Cascading Style Sheets (CSS), to make your reports stand out.

1: Getting started with R Markdown

In this chapter, you’ll learn about the three components of a Markdown file: the code, the text, and the metadata. You’ll also learn to add and modify each of these elements to your own reports, as you create your first Markdown files.

Video: Introduction to R Markdown

Creating your first R Markdown file

Throughout the course, you’ll be working on creating an investment report using two datasets from the World Bank IFC. The first dataset, investment_annual_summary, provides the summary of the dollars in millions provided to each region for each fiscal year, from 2012 to 2018. To get started on your report, you first want to print out the dataset.

To create your report, you’ll need to edit the Markdown file shown on the right of the DataCamp console, as described in the instructions, then press the green “Knit HTML” button to knit the file and see the resulting HTML file. We’ll discuss other output types later in the course.

Because this textbook is written using R Markdown, I’ll be including the R Markdown output for this chapter here.

In each exercise, the first code chunk in the Markdown file will load the readr package and the datasets you’ll be using in the exercise. You’ll learn more about the details of this code chunk later in the course, but you won’t need to modify it for any of the exercises in this chapter.

As you can see from the output, adding the dataset name to the code chunk and knitting the document resulted in the dataset being printed in your report. You can also see from the report that the dataset has 42 rows, and that each row contains information about the `fiscal_year, region, and dollars_in_millions for the investments made that year.

Adding code chunks to your file

When creating your own reports, one of the first things you’ll want to do is add some code! In the video, we discussed how you can add your own code by adding code chunks. Previously, we looked at the investment_annual_summary dataset we’ll be using throughout the course. In this exercise, let’s take a look at the annual summary dataset as well as the other dataset we’ll be using, investment_services_projects.

From the report, you can see that the investment_services_project dataset contains information about each individual project and includes a lot more variables than the investment_annual_summary dataset. You’ll be exploring both of these datasets in much more depth in the coming exercises as you create your investment report!

Video: Adding and formatting text

Question: Formatting text

Adding sections to your report

Previously, you added the names of the datasets you’ll be using to build your report to the Markdown file that will be used to create the report. Now, you’ll add some headers and text to the document to provide some additional detail about the datasets to the audience who will read the report.

As you can see from the report, the more hashes you place in front of the text, the smaller the header will be when you knit the file.

Video: The YAML header

Editing the YAML header

The YAML header contains the metadata for your report and includes information like the title, author, and output format. In this exercise, you’ll update your report by adding some more detail about who created the report and when it was created.

Remember, you can add the date to your report manually by entering the date as a string. However, adding the date with Sys.Date() is much more efficient and scalable, since it will ensure that the date is updated automatically each time you edit and knit your file.

Formatting the date

Now that your report includes some more high-level detail, you’d like to include the date using a different format. Be sure to refer to the tables of date formatting options from the video below.

Source: DataCamp

Keep in mind that there are many text and numeric options for formatting the date of your report. Now that you understand each of the elements of the Markdown file, and how to modify them, you’re ready to add more detail to your Investment Report!


2: Adding Analyses and Visualizations

In this chapter, you’ll use dplyr to begin to analyze the World Bank IFC datasets and include the analyses in your report. You’ll then create visualizations of the data using ggplot2 and learn to modify how the plots display in your knit report.

Video: Analyzing the data

Filtering for a specific country

Previously, you learned how to filter the data to find out more information about projects that occurred in Indonesia. Now, you’ll build a report that provides this information for another country that’s included in the investment_services_project data. In this exercise, you’ll begin filtering the data to gather information about projects that occurred in Brazil.

From the report, you can see that there were a total of 88 projects in Brazil during the 2012 to 2018 fiscal years.

Filtering for a specific year

Now that you’ve filtered the data for the projects in a specific country, you can filter the results further to look at all projects that occurred in the 2018 fiscal year. Recall, the fiscal year starts on July 1st of the previous year and ends on June 30th of the year of interest.

Now your report includes information about all projects that occurred in Brazil during the 2012 to 2018 fiscal years and the projects that occurred in the 2018 fiscal year. Next, you’ll learn how to create and add plots to these reports, so that your audience is able to visualize this information when they read the final report.

Referencing code results in the report

In this exercise, you’ll use summarize() and brazil_investment_projects_2018 to find the total investment amount for all projects in Brazil in the 2018 fiscal year. Then, you’ll add text to the report to include the information and reference the code results in the text, so that the calculated amount is printed in the text of the report when you knit the file.

Now, if you update the analysis and the brazil_investment_projects_2018_total amount changes, you won’t need to modify the amount manually in the sentence that describes the information. Instead, since the object name is referenced in the text, the report will update automatically the next time the report is knit.

Video: Adding plots

Visualizing the Investment Annual Summary data

Now that you have all of the data ready for the report you’re creating, you can start making plots that will be included in the report to help your audience visualize the data when they’re reading the report. You’ll start by creating a line plot of the investment_annual_summary data.

Notice that for each region, the highest investments were made during the 2013 or 2014 fiscal years. This is an example of an insight you might want to include as text in the final report.

Visualizing all projects for one country

Previously, you created a line plot using the investment_annual_summary. Now, you’ll use the data you filtered to create scatterplots that summarize the information about the projects that occurred in Brazil.

Notice the warning message in the report that tells us that there were 7 rows removed that contain missing values. You’ll learn how to handle warning messages later in the course, but this may be something you’d want to include in the text of your report to specify which data points were excluded from the plot and why.

Visualizing all projects for one country and year

Now, you’ll create a line plot using the data that was filtered for all projects that occurred in Brazil in the 2018 fiscal year. In the previous exercises, the labels were added for you. While creating this plot, you’ll gain some experience adding your own labels that will appear when you knit the report.

Now your report has plots for the Investment Annual Summary data, as well as for all projects in Brazil during the 2012 to 2018 fiscal years and all projects in the 2018 fiscal year. Now that you’ve created these plots for your report, you’ll learn some of the options you have for formatting how the plots appear in the final report.

Video: Plot options

Setting chunk options globally

Now that your plots are ready to include in your report, you can modify how they appear once the file is knit. Previously, you learned the difference between setting options globally and setting them locally. In this exercise, you’ll set options for the figures globally, which means the options will apply to all figures throughout the code chunks in the report.

Recall, the options for fig.align are 'left', 'right', and 'center'. These options can be set globally or locally, depending on whether or not you want all figures to appear uniformly throughout the report, or if you want the options to vary by figure.

Setting chunk options locally

When creating a report, you may want to set the chunk options locally so that the figure display in the final report varies. The investment_annual_summary data provides helpful background information, but the focus of the report is on projects in Brazil. In this exercise, you’ll modify the chunk options locally so that the plots that display information about projects in Brazil appear slightly larger in the final report than the plot that provides the overview of the Investment Annual Summary data.

You can see in the final report that the plots that display information about projects in Brazil are slightly larger than the plot that provides the Investment Annual Summary overview. Notice that the fig.align = center option remained in the setup code chunk at the top, so this option has been set globally and determines the alignment for all figures in the report.

Also note that you can override globally set options with locally set ones. For example, if you wanted all figures to display at 50% width except for two figures you wanted to be larger, you could include out.width = '50%' in the global options but out.width = '95%' in the local options of the two figures.

Adding figure captions

Now that the figures have been modified, you’ll add some captions to label the figures and provide some information about what is displayed in each plot.

Now you can see that each figure is labeled in the final report and includes a caption that describes what each plot displays.


3: Improving the Report

Now that you’ve learned how to add, label, and modify code chunks, you’ll learn about code chunk options. You can use these to determine whether the code and results appear in the knit report. You’ll also discover how to create lists and tables to include in your report.

Video: Organizing the report

Creating a bulleted list

Previously, you learned how to add text to your Markdown file to include additional information for your audience. Now, you’ll create a bulleted list to specify which regions are included in the investment_annual_summary data. Refer to the image below from the video to recall the list of regions that should be included in your table.

Source: DataCamp

Remember, you can structure the list formatting by adding indentation before an item on the list.

Creating a numbered list

When adding a list to your report, you can use either bulleted or numbered lists. In this exercise, you’ll modify the bulleted list of regions from the previous exercise to create a numbered list.

Adding a numbered list to the report is a helpful way to quickly display the total number of regions that are included in the data.

Adding a table

Previously, you printed the datasets used in your report to your report so that the audience was able to look through the data themselves. Now, you’ll create a table of the investment_region_summary to display this information more clearly to your audience. The investment_region_summary provides the total of all investments for each region from the 2012 to 2018 fiscal years.

Now your report begins with a list of all regions in the investment_annual_summary data, as well as a table that summarizes the total investments that were made to each region across the 2012 to 2018 fiscal years.

Video: Code chunk options

Question: Comparing code chunk options

  • include=FALSE: Code runs but does not appears in report, results do not appear.
  • echo=FALSE: Code runs but does not appear in the report, results appear.
  • eval=FALSE: Code appears in report but does not run, results do not appear.

The other option you recently learned is collapse, which was the only option you’ve learned so far that has a default of FALSE.

Collapsing blocks in the knit report

By default, the collapse option is set to FALSE, and the code and any output appear in the knit file in separate blocks. You encountered this earlier when creating plots of the data. In this exercise, you’ll modify the Markdown file so that the code and resulting warning messages appear in the same block.

Notice that the warning messages for the plots now appear as comments in the same block as the code that creates the plot, instead of being separated into an individual block. You’ll learn more about warning messages and how to specify whether or not they are included in the report soon!

Modifying the report using include and echo

The exercises in the course have used include = FALSE to prevent the code and results of the setup and data chunks from appearing in the knit report. Although you won’t modify those options, since the code and results from those chunks should be excluded from the report, it’s important to note how they impact the final report.

In this exercise, you’ll use the echo option to modify whether or not the code appears in the report.

Even though the code no longer displays in the knit report, the results of the code still appears. This is an option you’ll want to use if your report is being written for a non-technical audience, who is interested in the results but may not be interested in the code itself. Notice that the warning messages that mention that data was excluded from the plots still appear in the knit report. Next, you’ll learn how to modify whether or not these warning messages are included in the report!

Video: Warnings, messages, and errors

Excluding messages

In the past, you haven’t encountered messages in the report because the include option has been set to FALSE in the data chunk to prevent the code or the results from the code from appearing in the report. In this exercise, you’ll use the message option to prevent messages from appearing in the report, while still including the code in the report.

Notice that the code for the data chunk is now included without any of the messages that you would otherwise see when loading the data or packages used in the report.

Excluding warnings

Previously, you used the collapse option so that the code and resulting warning messages appear in the same block in the knit report. In this exercise, you’ll use the warning option to prevent warnings from appearing in the final report.

Notice that DataCamp added a sentence before each of the code chunks that were impacted to let the audience know that: ‘Projects that do not have an associated investment amount are excluded from the plot’. If you are excluding any warning messages from the report, it’s important to include information about any data that is not included and why.


4: Customizing the Report

In this final chapter, you’ll learn how to customize your report by adding a table of contents and adding a CSS file to the YAML header, to personalize reports with your brand’s fonts and colors. You’ll also learn how to efficiently create new reports from your data using parameters, which will save you time from manually updating existing reports to create new ones.

Video: Adding a table of contents

Adding the table of contents

Adding a table of contents to your report is a useful way to help your audience navigate through the different sections of your report. It provides an overview of what your report contains, and can help your audience navigate through the report easily. In this exercise, you’ll add a table of contents to your report to provide an overview of the topics that the report includes.

Now your audience can reference the table of contents at the beginning of your report to understand the information that will be covered in the report.

Specifying headers and number sectioning

Now that you’ve added a table of contents, you’ll modify how it appears in the report and which information it includes. You’ll use toc_depth to specify the depth of headers that will be included in the table of contents and number_sections to add section numbering for the headers in the report.

The headers were modified to start with a single hash before adding section numbering because, if the largest headers in the report start with two hashes, the section numbering will start with zeros. Remember that, for toc_depth, the default depth is 3 for HTML documents and 2 for PDF documents.

Adding table of contents options

When toc_float is included, the table of contents appears on the left side of the document and remains visible while the reader scrolls through the document. By default, it displays the largest header, will expand as someone is reading through the report or interacting with the table of contents to navigate to another section, and animates page scrolls when navigating the report.

In this exercise, you’ll add toc_float and modify these settings using the collapsed and smooth_scroll fields so that the full table of contents remains visible and page scrolls are not animated.

If you want to add toc_float to the report and keep the default collapsed and smooth_scroll options (both turned on, i.e. TRUE, by default), you can set the toc_float field to true in the YAML header.

Video: Creating a report with a parameter

Adding a parameter to the report

In this exercise, you’ll add a parameter for country to the report and modify the existing code so that you can create new reports about the investment projects for any country included in the investment_services_projects data.

Now that you’ve reviewed the code for your report, you’ll review the text and the YAML header of the document before creating a new report using the country parameter.

Creating a new report using a parameter

Now that you’ve added a parameter to the document, you’ll create a new report for Bangladesh from the investment_services_projects data using the country parameter.

Before knitting the report, you’ll review and modify the text of the document to ensure that the knit report will reflect the country that is specified in the parameter.

Notice that by only modifying the parameter, you were able to create a new report for Bangladesh that includes all of the information that was previously provided for Brazil. When creating new documents using parameters, there may be some information you want to add that is specific to the new country and report, but parameters are an efficient and quick way to get started!

Video: Multiple parameters

Adding multiple parameters to the report

Previously, you added a parameter for country to create new reports to summarize information about the investment projects for any country included in the investment_services_projects data. Now, you’ll add parameters for the fiscal year and modify the existing code so that you can create new reports about the investment projects for any country and fiscal year from the investment_services_projects data.

Now, you’ll be able to create a report for any country and fiscal year by modifying only the parameters in the YAML header!

Creating a new report using multiple parameters

Now that you’ve added parameters to account for the fiscal year, you’ll create a new report for another country and fiscal year from the investment_services_projects data.

Now you’re able to create new reports using multiple parameters by modifying only the information in the YAML header. As before, there may be some information you want to add that is specific to the country and fiscal year, but these parameters will help you get started on a new report!

Video: Customizing the report

Customizing the report style

Now that you’ve learned how to customize the style of your report, you’ll begin to add specific fonts and colors to your existing report.

Notice that the document background and code chunks in the knit file are modified to reflect the various fonts and colors you listed.

Customizing the header and table of contents

In this exercise, you’ll continue to add styles by modifying the table of contents and header sections of the Markdown file.

Notice the difference in appearance of the table of contents section. If you want to customize any of these sections further, you can add the properties you’ve used so far to any of the sections listed. For example, you can add opactity to the pre section to customize the code chunks, the same way that you did with the #header section.

Customizing the title, author, and date

Previously, you modified the header of the document using the #header section. Now, you’ll practice customizing the title, author, and date sections individually.

In this version of the report, you used the same settings for the h4.author and h4.date sections, but you can use any colors or fonts you’d like with this syntax.

Referencing the CSS file

Rather than adding styles to each Markdown file within the file, you can create and reference a Cascading Style Sheet (CSS) file each time you create a new file that contains particular styles and fonts.

In this exercise, the styles you’ve specified have been added to a CSS file called styles.css. You’ll reference this file in the YAML header instead of specifying the styles within the Markdown file.

Notice that, although the styles are no longer listed within the Markdown file, the knit report still reflects all of the styles you’ve been adding over the past few exercises.

Video: Congratulations!


Module 4: Data Manipulation with dplyr

Say you’ve found a great dataset and would like to learn more about it. How can you start to answer the questions you have about the data? You can use dplyr to answer those questions—it can also help with basic transformations of your data. You’ll also learn to aggregate your data and add, remove, or change the variables. Along the way, you’ll explore a dataset containing information about counties in the United States. You’ll finish the course by applying these tools to the babynames dataset to explore trends of baby names in the United States.

1: Transforming Data with dplyr

Learn verbs you can use to transform your data, including select, filter, arrange, and mutate. You’ll use these functions to modify the counties dataset to view particular observations and answer questions about the data.

Video: The counties dataset

Question: Understanding your data

Load the counties data set and the dplyr package

counties <- read.csv("data_acs2015_county_data.csv")
library("dplyr")
## Warning: package 'dplyr' was built under R version 3.5.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Take a look at the counties dataset using the glimpse() function.

glimpse(counties)
## Observations: 3,141
## Variables: 40
## $ census_id          <int> 1001, 1003, 1005, 1007, 1009, 1011, 1013, 1015, ...
## $ state              <fct> Alabama, Alabama, Alabama, Alabama, Alabama, Ala...
## $ county             <fct> Autauga, Baldwin, Barbour, Bibb, Blount, Bullock...
## $ region             <fct> South, South, South, South, South, South, South,...
## $ metro              <fct> , , , , , , , , , , , , , , , , , , , , , , , , , 
## $ population         <int> 55221, 195121, 26932, 22604, 57710, 10678, 20354...
## $ men                <int> 26745, 95314, 14497, 12073, 28512, 5660, 9502, 5...
## $ women              <int> 28476, 99807, 12435, 10531, 29198, 5018, 10852, ...
## $ hispanic           <dbl> 2.6, 4.5, 4.6, 2.2, 8.6, 4.4, 1.2, 3.5, 0.4, 1.5...
## $ white              <dbl> 75.8, 83.1, 46.2, 74.5, 87.9, 22.2, 53.3, 73.0, ...
## $ black              <dbl> 18.5, 9.5, 46.7, 21.4, 1.5, 70.7, 43.8, 20.3, 40...
## $ native             <dbl> 0.4, 0.6, 0.2, 0.4, 0.3, 1.2, 0.1, 0.2, 0.2, 0.6...
## $ asian              <dbl> 1.0, 0.7, 0.4, 0.1, 0.1, 0.2, 0.4, 0.9, 0.8, 0.3...
## $ pacific            <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...
## $ citizens           <int> 40725, 147695, 20714, 17495, 42345, 8057, 15581,...
## $ income             <int> 51281, 50254, 32964, 38678, 45813, 31938, 32229,...
## $ income_err         <int> 2391, 1263, 2973, 3995, 3141, 5884, 1793, 925, 2...
## $ income_per_cap     <int> 24974, 27317, 16824, 18431, 20532, 17580, 18390,...
## $ income_per_cap_err <int> 1080, 711, 798, 1618, 708, 2055, 714, 489, 1366,...
## $ poverty            <dbl> 12.9, 13.4, 26.7, 16.8, 16.7, 24.6, 25.4, 20.5, ...
## $ child_poverty      <dbl> 18.6, 19.2, 45.3, 27.9, 27.2, 38.4, 39.2, 31.6, ...
## $ professional       <dbl> 33.2, 33.1, 26.8, 21.5, 28.5, 18.8, 27.5, 27.3, ...
## $ service            <dbl> 17.0, 17.7, 16.1, 17.9, 14.1, 15.0, 16.6, 17.7, ...
## $ office             <dbl> 24.2, 27.1, 23.1, 17.8, 23.9, 19.7, 21.9, 24.2, ...
## $ construction       <dbl> 8.6, 10.8, 10.8, 19.0, 13.5, 20.1, 10.3, 10.5, 1...
## $ production         <dbl> 17.1, 11.2, 23.1, 23.7, 19.9, 26.4, 23.7, 20.4, ...
## $ drive              <dbl> 87.5, 84.7, 83.8, 83.2, 84.9, 74.9, 84.5, 85.3, ...
## $ carpool            <dbl> 8.8, 8.8, 10.9, 13.5, 11.2, 14.9, 12.4, 9.4, 11....
## $ transit            <dbl> 0.1, 0.1, 0.4, 0.5, 0.4, 0.7, 0.0, 0.2, 0.2, 0.2...
## $ walk               <dbl> 0.5, 1.0, 1.8, 0.6, 0.9, 5.0, 0.8, 1.2, 0.3, 0.6...
## $ other_transp       <dbl> 1.3, 1.4, 1.5, 1.5, 0.4, 1.7, 0.6, 1.2, 0.4, 0.7...
## $ work_at_home       <dbl> 1.8, 3.9, 1.6, 0.7, 2.3, 2.8, 1.7, 2.7, 2.1, 2.5...
## $ mean_commute       <dbl> 26.5, 26.4, 24.1, 28.8, 34.9, 27.5, 24.6, 24.1, ...
## $ employed           <int> 23986, 85953, 8597, 8294, 22189, 3865, 7813, 474...
## $ private_work       <dbl> 73.6, 81.5, 71.8, 76.8, 82.0, 79.5, 77.4, 74.1, ...
## $ public_work        <dbl> 20.9, 12.3, 20.8, 16.1, 13.5, 15.1, 16.2, 20.8, ...
## $ self_employed      <dbl> 5.5, 5.8, 7.3, 6.7, 4.2, 5.4, 6.2, 5.0, 2.8, 7.9...
## $ family_work        <dbl> 0.0, 0.4, 0.1, 0.4, 0.4, 0.0, 0.2, 0.1, 0.0, 0.5...
## $ unemployment       <dbl> 7.6, 7.5, 17.6, 8.3, 7.7, 18.0, 10.9, 12.3, 8.9,...
## $ land_area          <dbl> 594.44, 1589.78, 884.88, 622.58, 644.78, 622.81,...

Selecting columns

Select the following four columns from the counties variable:

  • state
  • county
  • population
  • poverty

You don’t need to save the result to a variable.

Select the columns

counties %>%
  select(state, county, population, poverty)

Recall that if you want to keep the data you’ve selected, you can use assignment to create a new table.

Video: The filter and arrange verbs

Arranging observations

Here you see the counties_selected dataset with a few interesting variables selected. These variables: private_work, public_work, self_employed describe whether people work for the government, for private companies, or for themselves.

In these exercises, you’ll sort these observations to find the most interesting cases.

counties_selected <- counties %>%
  select(state, county, population, private_work, public_work, self_employed)

Add a verb to sort in descending order of public_work

counties_selected %>%
  arrange(desc(public_work))

We sorted the counties in descending order according to public_work. What if we were interested in looking at observations in counties that have a large population or within a specific state? Let’s take a look at that next!

Filtering for conditions

You use the filter() verb to get only observations that match a particular condition, or match multiple conditions.

counties_selected <- counties %>%
  select(state, county, population)

Filter for counties with a population above 1000000

counties_selected %>%
  filter(population > 1000000)

Filter for counties in the state of California that have a population above 1000000

counties_selected %>%
  filter(state == "California",
         population > 1000000)
##        state         county population
## 1 California        Alameda    1584983
## 2 California   Contra Costa    1096068
## 3 California    Los Angeles   10038388
## 4 California         Orange    3116069
## 5 California      Riverside    2298032
## 6 California     Sacramento    1465832
## 7 California San Bernardino    2094769
## 8 California      San Diego    3223096
## 9 California    Santa Clara    1868149

Now you know that there are 9 counties in the state of California with a population greater than one million. In the next exercise, you’ll practice filtering and then sorting a dataset to focus on specific observations!

Filtering and arranging

We’re often interested in both filtering and sorting a dataset, to focus on observations of particular interest to you. Here, you’ll find counties that are extreme examples of what fraction of the population works in the private sector.

counties_selected <- counties %>%
  select(state, county, population, private_work, public_work, self_employed)

Filter for Texas and more than 10000 people; sort in descending order of private_work

counties_selected %>%
  filter(state == "Texas", population > 10000) %>%
  arrange(desc(private_work))
##     state       county population private_work public_work self_employed
## 1   Texas        Gregg     123178         84.7         9.8           5.4
## 2   Texas       Collin     862215         84.1        10.0           5.8
## 3   Texas       Dallas    2485003         83.9         9.5           6.4
## 4   Texas       Harris    4356362         83.4        10.1           6.3
## 5   Texas      Andrews      16775         83.1         9.6           6.8
## 6   Texas      Tarrant    1914526         83.1        11.4           5.4
## 7   Texas        Titus      32553         82.5        10.0           7.4
## 8   Texas       Denton     731851         82.2        11.9           5.7
## 9   Texas        Ector     149557         82.0        11.2           6.7
## 10  Texas        Moore      22281         82.0        11.7           5.9
## 11  Texas    Jefferson     252872         81.9        13.4           4.5
## 12  Texas    Fort Bend     658331         81.7        12.5           5.7
## 13  Texas       Panola      23900         81.5        11.7           6.9
## 14  Texas      Midland     151290         81.2        11.4           7.1
## 15  Texas       Potter     122352         81.2        12.4           6.2
## 16  Texas         Frio      18168         81.1        15.4           3.4
## 17  Texas      Johnson     155450         80.9        13.3           5.7
## 18  Texas        Smith     217552         80.9        12.7           6.3
## 19  Texas       Orange      83217         80.8        13.7           5.4
## 20  Texas     Harrison      66417         80.6        13.9           5.4
## 21  Texas     Brazoria     331741         80.5        14.4           5.0
## 22  Texas      Calhoun      21666         80.5        12.8           6.7
## 23  Texas       Austin      28886         80.4        11.8           7.7
## 24  Texas   Montgomery     502586         80.4        11.7           7.8
## 25  Texas    Jim Wells      41461         80.3        12.6           7.1
## 26  Texas   Hutchinson      21858         80.2        13.9           5.8
## 27  Texas    Ochiltree      10642         80.1        10.4           9.2
## 28  Texas     Victoria      90099         80.1        12.5           7.1
## 29  Texas       Hardin      55375         80.0        15.7           4.2
## 30  Texas        Bexar    1825502         79.6        14.8           5.6
## 31  Texas         Gray      22983         79.6        13.3           7.0
## 32  Texas     McLennan     241505         79.5        14.7           5.7
## 33  Texas       Newton      14231         79.4        15.8           4.8
## 34  Texas         Ward      11225         79.3        15.0           5.3
## 35  Texas       Grimes      26961         79.2        15.0           5.7
## 36  Texas        Lamar      49566         79.2        13.9           6.7
## 37  Texas         Rusk      53457         79.2        12.8           7.9
## 38  Texas      Wharton      41264         78.9        12.3           8.6
## 39  Texas   Williamson     473592         78.9        14.9           6.0
## 40  Texas       Marion      10248         78.7        15.2           5.9
## 41  Texas         Hood      53171         78.6        11.0           9.9
## 42  Texas         Hunt      88052         78.6        14.4           7.0
## 43  Texas        Ellis     157058         78.5        14.2           7.3
## 44  Texas       Upshur      40096         78.5        13.0           8.3
## 45  Texas      Grayson     122780         78.4        13.8           7.5
## 46  Texas      Liberty      77486         78.4        13.6           7.8
## 47  Texas     Atascosa      47050         78.1        14.3           7.5
## 48  Texas       Waller      45847         78.1        16.3           5.6
## 49  Texas   Deaf Smith      19245         78.0        13.6           7.9
## 50  Texas     Chambers      37251         77.9        15.9           6.1
## 51  Texas       Jasper      35768         77.7        15.5           6.7
## 52  Texas       Scurry      17238         77.7        16.2           6.0
## 53  Texas       Parmer      10004         77.6        12.5           9.3
## 54  Texas San Patricio      66070         77.6        17.9           4.4
## 55  Texas       Taylor     134435         77.5        16.5           5.7
## 56  Texas      Fayette      24849         77.4        13.4           9.0
## 57  Texas      Kaufman     109289         77.4        16.7           5.8
## 58  Texas    Matagorda      36598         77.4        15.6           7.0
## 59  Texas        Comal     119632         77.2        13.2           9.1
## 60  Texas     Gonzales      20172         77.2        13.2           9.1
## 61  Texas         Hill      34923         77.1        15.0           7.6
## 62  Texas      Hockley      23322         77.1        15.6           7.1
## 63  Texas    Guadalupe     143460         77.0        17.9           4.8
## 64  Texas        Cooke      38761         76.9        14.8           8.1
## 65  Texas     Rockwall      85536         76.9        16.2           6.7
## 66  Texas         Wise      61243         76.9        14.9           8.1
## 67  Texas       Gaines      18916         76.8        12.0          11.2
## 68  Texas    Tom Green     115056         76.8        16.5           6.6
## 69  Texas        Erath      40039         76.7        15.3           7.9
## 70  Texas      Hopkins      35645         76.7        13.2           9.8
## 71  Texas   Palo Pinto      27921         76.7        14.1           9.0
## 72  Texas       Nueces     352060         76.6        16.3           6.9
## 73  Texas       Parker     121418         76.6        15.2           8.0
## 74  Texas      Lubbock     290782         76.5        16.8           6.5
## 75  Texas       Travis    1121645         76.5        16.0           7.4
## 76  Texas     Eastland      18328         76.4        15.1           7.8
## 77  Texas     Montague      19478         76.4        13.3           9.9
## 78  Texas     Angelina      87748         76.2        17.8           5.9
## 79  Texas      Jackson      14486         76.1        13.6           9.3
## 80  Texas        Nolan      15061         76.1        17.1           6.5
## 81  Texas         Cass      30328         76.0        15.9           8.1
## 82  Texas        Duval      11577         76.0        17.2           6.4
## 83  Texas         Hale      35504         75.9        16.4           7.4
## 84  Texas    Galveston     308163         75.8        18.2           5.8
## 85  Texas       Bosque      17971         75.7        12.8          11.2
## 86  Texas    Henderson      79016         75.6        13.9          10.4
## 87  Texas        Bowie      93155         75.5        19.4           5.1
## 88  Texas      Aransas      24292         75.4        12.4          12.2
## 89  Texas      Randall     126782         75.4        17.5           7.1
## 90  Texas       Morris      12700         75.3        17.5           7.1
## 91  Texas        Rains      11037         75.3        15.6           9.0
## 92  Texas       Shelby      25725         75.3        16.8           7.8
## 93  Texas        Brown      37833         75.2        16.5           8.2
## 94  Texas       Sabine      10440         75.1        18.1           6.8
## 95  Texas         Wood      42712         75.0        15.6           9.0
## 96  Texas    Gillespie      25398         74.8        10.5          14.3
## 97  Texas      Navarro      48118         74.8        17.3           7.8
## 98  Texas     Caldwell      39347         74.7        18.6           6.7
## 99  Texas        Tyler      21462         74.6        19.8           5.3
## 100 Texas         Webb     263251         74.6        19.4           5.8
## 101 Texas      Wichita     131957         74.6        19.2           6.0
## 102 Texas          Lee      16664         74.5        16.7           8.4
## 103 Texas       Burnet      44144         74.1        14.2          11.7
## 104 Texas      El Paso     831095         73.9        20.0           6.0
## 105 Texas      Kendall      37361         73.9        15.7          10.4
## 106 Texas         Lamb      13742         73.8        16.1           9.9
## 107 Texas         Camp      12516         73.7        16.0           9.9
## 108 Texas    Freestone      19586         73.6        19.0           7.3
## 109 Texas       Medina      47392         73.6        17.3           9.0
## 110 Texas   Washington      34236         73.6        18.1           8.3
## 111 Texas         Hays     177562         73.5        19.4           7.0
## 112 Texas     Live Oak      11873         73.5        14.2          12.1
## 113 Texas       Lavaca      19549         73.4        14.4          12.1
## 114 Texas       Zapata      14308         73.3        18.1           8.7
## 115 Texas       Fannin      33748         73.1        19.8           7.1
## 116 Texas        Young      18329         73.1        15.7          10.8
## 117 Texas     Cherokee      51167         72.9        20.4           6.5
## 118 Texas         Kerr      50149         72.8        16.7          10.4
## 119 Texas      Cameron     417947         72.7        18.2           8.9
## 120 Texas     Colorado      20757         72.6        17.3           9.0
## 121 Texas         Clay      10479         72.4        17.0          10.4
## 122 Texas       DeWitt      20540         72.4        17.7           9.6
## 123 Texas     Franklin      10599         72.3        13.7          13.3
## 124 Texas     Comanche      13623         72.2        15.0          12.3
## 125 Texas       Howard      36105         72.0        23.0           4.9
## 126 Texas        Llano      19323         72.0        11.4          16.4
## 127 Texas  Nacogdoches      65531         72.0        20.0           7.9
## 128 Texas       Wilson      45509         71.8        20.6           7.6
## 129 Texas    Van Zandt      52736         71.6        16.8          11.1
## 130 Texas      Bastrop      76948         71.3        20.1           8.3
## 131 Texas     Callahan      13532         71.2        17.0           9.4
## 132 Texas        Falls      17410         70.8        20.5           8.4
## 133 Texas      Hidalgo     819217         70.7        17.2          11.9
## 134 Texas          Bee      32659         70.5        22.7           6.7
## 135 Texas       Dimmit      10682         70.5        21.9           7.6
## 136 Texas       Blanco      10723         70.4        17.5          11.8
## 137 Texas       Zavala      12060         70.4        23.8           5.6
## 138 Texas        Terry      12687         70.2        19.7           9.8
## 139 Texas     Anderson      57915         69.8        23.6           6.4
## 140 Texas    Robertson      16532         69.8        22.4           7.6
## 141 Texas       Karnes      14879         69.6        22.0           7.8
## 142 Texas      Kleberg      32029         69.5        25.9           4.4
## 143 Texas     Burleson      17293         69.4        21.7           8.4
## 144 Texas         Leon      16819         69.4        14.5          15.7
## 145 Texas      Bandera      20796         69.2        19.0          11.6
## 146 Texas         Bell     326041         69.2        25.7           5.0
## 147 Texas  San Jacinto      27023         69.1        23.0           7.6
## 148 Texas     Maverick      56548         68.8        25.0           6.1
## 149 Texas      Trinity      14405         68.8        17.8          12.1
## 150 Texas        Jones      19978         68.7        20.5          10.3
## 151 Texas         Polk      46113         68.7        22.2           8.7
## 152 Texas        Pecos      15807         68.6        24.5           6.0
## 153 Texas       Dawson      13542         68.5        21.9           9.5
## 154 Texas        Milam      24344         68.5        21.3           9.8
## 155 Texas    Red River      12567         68.5        17.4          13.8
## 156 Texas       Brazos     205271         68.3        25.6           5.8
## 157 Texas        Starr      62648         68.2        22.5           8.6
## 158 Texas      Houston      22949         67.8        22.1           9.8
## 159 Texas       Reeves      14179         67.8        28.3           3.9
## 160 Texas      Runnels      10445         67.7        19.9          12.1
## 161 Texas      Willacy      22002         66.9        23.9           9.1
## 162 Texas       Uvalde      26952         66.8        23.8           9.4
## 163 Texas    Wilbarger      13158         66.0        29.6           4.4
## 164 Texas    Val Verde      48980         65.9        28.9           5.1
## 165 Texas     Lampasas      20219         65.5        26.5           8.0
## 166 Texas      Madison      13838         65.1        22.4          11.8
## 167 Texas    Limestone      23454         61.3        28.6          10.1
## 168 Texas      Coryell      76128         60.3        34.2           5.3
## 169 Texas       Walker      69330         59.5        36.2           4.2

You’ve learned how to filter and sort a dataset to answer questions about the data. Notice that you only need to slightly modify your code if you are interested in sorting the observations by a different column.

Video: Mutate

Calculating the number of government employees

In the video, you used the unemployment variable, which is a percentage, to calculate the number of unemployed people in each county. In this exercise, you’ll do the same with another percentage variable: public_work.

The code provided already selects the state, county, population, and public_work columns.

counties_selected <- counties %>%
  select(state, county, population, public_work)

Sort in descending order of the public_workers column

counties_selected %>%
  mutate(public_workers = public_work * population / 100) %>%
  arrange(desc(public_workers))

It looks like Los Angeles is the county with the most government employees.

Calculating the percentage of women in a county

The dataset includes columns for the total number (not percentage) of men and women in each county. You could use this, along with the population variable, to compute the fraction of men (or women) within each county.

In this exercise, you’ll select the relevant columns yourself.

Select the columns state, county, population, men, and women

counties_selected <- counties %>%
  select(state, county, population, men, women)

Calculate proportion_women as the fraction of the population made up of women

counties_selected %>%
  mutate(proportion_women = women / population)

Notice that the proportion_women variable was added as a column to the counties_selected dataset, and the data now has 6 columns instead of 5.

Select, mutate, filter, and arrange

In this exercise, you’ll put together everything you’ve learned in this chapter (select(), mutate(), filter() and arrange()), to find the counties with the highest proportion of men.

counties %>%
  # Select the five columns 
  select(state, county, population, men, women) %>%
  # Add the proportion_men variable
  mutate(proportion_men = men / population) %>%
  # Filter for population of at least 10,000
  filter(population >= 10000) %>%
  # Arrange proportion of men in descending order 
  arrange(desc(proportion_men)) 
##               state                       county population     men   women
## 1          Virginia                       Sussex      11864    8130    3734
## 2        California                       Lassen      32645   21818   10827
## 3           Georgia                Chattahoochee      11914    7940    3974
## 4         Louisiana               West Feliciana      15415   10228    5187
## 5           Florida                        Union      15191    9830    5361
## 6             Texas                        Jones      19978   12652    7326
## 7          Missouri                       DeKalb      12782    8080    4702
## 8             Texas                      Madison      13838    8648    5190
## 9          Virginia                  Greensville      11760    7303    4457
## 10            Texas                     Anderson      57915   35469   22446
## 11         Arkansas                      Lincoln      14062    8596    5466
## 12            Texas                          Bee      32659   19722   12937
## 13          Florida                     Hamilton      14395    8671    5724
## 14         Illinois                     Lawrence      16665    9997    6668
## 15      Mississippi                 Tallahatchie      14959    8907    6052
## 16      Mississippi                       Greene      14129    8409    5720
## 17            Texas                       Karnes      14879    8799    6080
## 18            Texas                         Frio      18168   10723    7445
## 19          Florida                         Gulf      15785    9235    6550
## 20          Florida                     Franklin      11628    6800    4828
## 21            Texas                       Walker      69330   40484   28846
## 22          Georgia                      Telfair      16416    9502    6914
## 23         Colorado                      Fremont      46809   27003   19806
## 24        Louisiana                        Allen      25653   14784   10869
## 25             Ohio                        Noble      14508    8359    6149
## 26          Georgia                     Tattnall      25302   14529   10773
## 27        Louisiana                    Claiborne      16639    9524    7115
## 28            Texas                        Pecos      15807    9003    6804
## 29         Missouri                      Pulaski      53443   30390   23053
## 30            Texas                       Howard      36105   20496   15609
## 31         Illinois                      Johnson      12829    7267    5562
## 32            Texas                       Dawson      13542    7670    5872
## 33          Florida                       DeSoto      34957   19756   15201
## 34            Texas                       Reeves      14179    8004    6175
## 35          Florida                       Taylor      22685   12781    9904
## 36        Tennessee                      Bledsoe      13686    7677    6009
## 37        Louisiana                        Grant      22362   12523    9839
## 38          Florida                      Wakulla      31128   17366   13762
## 39        Tennessee                       Morgan      21794   12150    9644
## 40         Kentucky                       Morgan      13428    7477    5951
## 41          Florida                     Bradford      27223   15150   12073
## 42       California                        Kings     150998   83958   67040
## 43         Kentucky                       Martin      12631    7020    5611
## 44         Virginia                   Buckingham      17068    9476    7592
## 45       California                    Del Norte      27788   15418   12370
## 46         Missouri                         Pike      18517   10273    8244
## 47          Florida                      Jackson      48900   27124   21776
## 48      Mississippi                        Yazoo      27911   15460   12451
## 49          Florida                       Glades      13272    7319    5953
## 50         New York                     Franklin      51280   28253   23027
## 51     Pennsylvania                        Union      44958   24755   20203
## 52        Tennessee                        Wayne      16897    9295    7602
## 53         Colorado                       Summit      28940   15912   13028
## 54            Texas                       Grimes      26961   14823   12138
## 55          Georgia                        Macon      14045    7720    6325
## 56             Ohio                      Madison      43456   23851   19605
## 57         Arkansas                  St. Francis      27345   14996   12349
## 58         Michigan                     Chippewa      38586   21156   17430
## 59         Kentucky                     McCreary      18001    9863    8138
## 60          Florida                   Washington      24629   13478   11151
## 61         Virginia                Prince George      37380   20440   16940
## 62          Florida                      Calhoun      14615    7991    6624
## 63   North Carolina                        Avery      17695    9673    8022
## 64          Indiana                     Sullivan      21111   11540    9571
## 65           Oregon                      Malheur      30551   16697   13854
## 66          Wyoming                       Carbon      15739    8592    7147
## 67         Maryland                     Somerset      25980   14150   11830
## 68        Tennessee                     Hardeman      26253   14297   11956
## 69          Florida                        Dixie      16091    8746    7345
## 70          Florida                       Hardee      27468   14920   12548
## 71            Texas                      Willacy      22002   11947   10055
## 72     North Dakota                     Williams      29619   16066   13553
## 73         Colorado                        Logan      21928   11889   10039
## 74         Michigan                     Houghton      36660   19876   16784
## 75            Texas                        Tyler      21462   11629    9833
## 76   North Carolina                       Onslow     183753   99526   84227
## 77         New York                      Wyoming      41446   22442   19004
## 78         Illinois                     Randolph      33069   17902   15167
## 79       California                       Amador      36995   20012   16983
## 80            Texas                     Live Oak      11873    6421    5452
## 81         Michigan                      Gogebic      15824    8556    7268
## 82         Oklahoma                       Hughes      13785    7450    6335
## 83            Texas                       Scurry      17238    9312    7926
## 84          Georgia                        Dooly      14293    7717    6576
## 85        Wisconsin                        Adams      20451   11033    9418
## 86          Florida                   Okeechobee      39255   21176   18079
## 87   North Carolina                       Greene      21328   11499    9829
## 88         Missouri                 St. Francois      66010   35586   30424
## 89            Texas                        Terry      12687    6835    5852
## 90         Colorado                     Gunnison      15651    8427    7224
## 91         Missouri                  Mississippi      14208    7650    6558
## 92          Alabama                      Barbour      26932   14497   12435
## 93          Indiana                        Miami      36211   19482   16729
## 94   South Carolina                    Edgefield      26466   14234   12232
## 95         Oklahoma                     Okfuskee      12248    6585    5663
## 96         Virginia                     Powhatan      28207   15159   13048
## 97        Louisiana               East Feliciana      19855   10666    9189
## 98         Oklahoma                      Beckham      23300   12515   10785
## 99            Texas                         Polk      46113   24757   21356
## 100       Wisconsin                      Jackson      20543   11022    9521
## 101          Alaska Fairbanks North Star Borough      99705   53477   46228
## 102         Indiana                        Perry      19414   10407    9007
## 103         Arizona                       Graham      37407   20049   17358
## 104        Kentucky                         Clay      21300   11415    9885
## 105       Tennessee                      Johnson      18017    9643    8374
## 106        Colorado                      Chaffee      18309    9798    8511
## 107       Louisiana                    Catahoula      10247    5483    4764
## 108         Georgia                     Charlton      13130    7024    6106
## 109         Florida                       Holmes      19635   10501    9134
## 110          Alaska        Kodiak Island Borough      13973    7468    6505
## 111          Kansas                  Leavenworth      78227   41806   36421
## 112         Alabama                         Bibb      22604   12073   10531
## 113       Minnesota                         Pine      29218   15605   13613
## 114        Colorado                        Grand      14411    7689    6722
## 115  South Carolina                     Marlboro      27993   14930   13063
## 116          Kansas                        Riley      75022   40002   35020
## 117           Texas                         Gray      22983   12253   10730
## 118           Texas                      Houston      22949   12227   10722
## 119        Oklahoma                     Woodward      20986   11172    9814
## 120            Ohio                       Marion      65943   35022   30921
## 121         Georgia                        Butts      23445   12451   10994
## 122    Pennsylvania                        Wayne      51642   27420   24222
## 123        Illinois                        Perry      21810   11572   10238
## 124        Michigan                      Gratiot      41878   22213   19665
## 125        Colorado                        Eagle      52576   27887   24689
## 126         Indiana                       Putnam      37650   19967   17683
## 127        Missouri                       Cooper      17593    9328    8265
## 128        Colorado                       Pitkin      17420    9235    8185
## 129         Wyoming                       Goshen      13544    7180    6364
## 130         Alabama                      Bullock      10678    5660    5018
## 131         Florida                    Jefferson      14198    7522    6676
## 132         Florida                       Hendry      38363   20318   18045
## 133        Virginia                     Nottoway      15711    8320    7391
## 134         Georgia                        Dodge      21180   11215    9965
## 135   Massachusetts                    Nantucket      10556    5589    4967
## 136        Michigan                        Ionia      64064   33917   30147
## 137         Wyoming                     Sublette      10117    5353    4764
## 138       Wisconsin                       Juneau      26494   14018   12476
## 139         Florida                       Monroe      75901   40159   35742
## 140    Pennsylvania                   Huntingdon      45906   24284   21622
## 141        Illinois                          Lee      35027   18524   16503
## 142       Louisiana                         Winn      14855    7851    7004
## 143            Ohio                     Pickaway      56515   29866   26649
## 144        Kentucky                       Oldham      63037   33307   29730
## 145           Texas                       Fannin      33748   17810   15938
## 146        New York                       Seneca      35144   18527   16617
## 147           Texas                         Rusk      53457   28172   25285
## 148         Florida                      Madison      18729    9869    8860
## 149        Colorado                         Park      16189    8525    7664
## 150         Florida                    Gilchrist      16992    8946    8046
## 151         Florida                        Baker      27135   14277   12858
## 152          Alaska           Bethel Census Area      17776    9351    8425
## 153        Virginia           Manassas Park city      15625    8219    7406
## 154          Nevada                     Humboldt      17067    8971    8096
## 155       Wisconsin                     Waushara      24321   12783   11538
## 156           Texas                       DeWitt      20540   10795    9745
## 157        Oklahoma                        Atoka      13906    7307    6599
## 158           Idaho                        Idaho      16312    8571    7741
## 159            Ohio                         Ross      77334   40627   36707
## 160        Oklahoma                        Texas      21588   11340   10248
## 161            Utah                      Sanpete      28261   14845   13416
## 162        Illinois                      Fayette      22136   11622   10514
## 163       Tennessee                      Hickman      24283   12745   11538
## 164        Virginia                  Southampton      18410    9661    8749
## 165        Virginia                    Brunswick      16930    8882    8048
## 166        Virginia                    Lunenburg      12558    6588    5970
## 167        Virginia                          Lee      25206   13223   11983
## 168        Kentucky                    Christian      74159   38891   35268
## 169         Wyoming                       Albany      37565   19692   17873
## 170    South Dakota                        Meade      26381   13823   12558
## 171  North Carolina                        Anson      26135   13693   12442
## 172          Alaska      Kenai Peninsula Borough      57221   29974   27247
## 173        Missouri                     Moniteau      15801    8277    7524
## 174        Missouri                     Randolph      25135   13159   11976
## 175    Pennsylvania                       Centre     157823   82583   75240
## 176        Missouri                       Phelps      45029   23557   21472
## 177           Idaho                        Teton      10285    5379    4906
## 178       Wisconsin                        Dodge      88547   46309   42238
## 179     Mississippi                    Sunflower      27911   14594   13317
## 180       Louisiana                       Vernon      52476   27435   25041
## 181          Nevada                         Elko      51562   26951   24611
## 182        Missouri                        Osage      13758    7190    6568
## 183           Texas                    Freestone      19586   10230    9356
## 184        Colorado                        Routt      23606   12328   11278
## 185  South Carolina                          Lee      18461    9641    8820
## 186          Alaska    Matanuska-Susitna Borough      96178   50205   45973
## 187        Virginia                 Norfolk city     245452  128079  117373
## 188      California                     Tuolumne      54079   28218   25861
## 189         Wyoming                   Sweetwater      44772   23359   21413
## 190         Georgia                     Mitchell      22982   11990   10992
## 191           Texas                        Duval      11577    6038    5539
## 192          Oregon                     Umatilla      76738   40004   36734
## 193        New York                        Essex      38912   20283   18629
## 194       Tennessee                   Lauderdale      27427   14296   13131
## 195      Washington                     Franklin      86443   45052   41391
## 196    South Dakota                      Yankton      22636   11796   10840
## 197      New Mexico                      Socorro      17494    9115    8379
## 198          Kansas                        Geary      36787   19166   17621
## 199      New Mexico                     Torrance      15853    8258    7595
## 200        Maryland                     Allegany      73549   38300   35249
## 201        Illinois                   Montgomery      29348   15282   14066
## 202        Colorado                       Moffat      13117    6830    6287
## 203       Wisconsin                     Chippewa      63209   32911   30298
## 204    Pennsylvania                   Clearfield      81343   42338   39005
## 205        New York                    Jefferson     118947   61875   57072
## 206      California                       Colusa      21396   11129   10267
## 207        Oklahoma                        Caddo      29495   15339   14156
## 208            Utah                         Juab      10400    5404    4996
## 209         Arizona                        Pinal     389772  202502  187270
## 210       Wisconsin                        Grant      51489   26741   24748
## 211            Iowa                        Jones      20560   10674    9886
## 212        Arkansas                        Izard      13480    6997    6483
## 213    North Dakota                        Stark      28628   14859   13769
## 214         Wyoming                     Campbell      48013   24914   23099
## 215        Illinois                       Fulton      36323   18844   17479
## 216            Iowa                        Story      93586   48551   45035
## 217         Montana                    Jefferson      11502    5965    5537
## 218            Iowa                    Jefferson      17318    8981    8337
## 219        Oklahoma                     Comanche     125531   65098   60433
## 220         Indiana                  Switzerland      10500    5443    5057
## 221           Texas                      Kleberg      32029   16601   15428
## 222           Idaho                      Gooding      15233    7895    7338
## 223        Michigan                     Manistee      24536   12714   11822
## 224        Virginia                   Montgomery      96467   49981   46486
## 225    Pennsylvania                     Somerset      76617   39695   36922
## 226    Pennsylvania                       Greene      37938   19652   18286
## 227        Nebraska                       Saline      14360    7438    6922
## 228        New York                   Washington      62700   32471   30229
## 229    North Dakota                         Ward      67736   35076   32660
## 230        Virginia                         Wise      40530   20986   19544
## 231      Washington                        Mason      60791   31476   29315
## 232   West Virginia                   Monongalia     101668   52641   49027
## 233          Oregon                       Morrow      11204    5800    5404
## 234        Nebraska                       Colfax      10522    5445    5077
## 235         Indiana                      LaPorte     111280   57575   53705
## 236        New York                       Greene      48312   24996   23316
## 237         Florida                     Columbia      67806   35080   32726
## 238           Texas                      Wichita     131957   68245   63712
## 239         Wyoming                        Teton      22311   11537   10774
## 240       Minnesota                      Chisago      53834   27832   26002
## 241       Louisiana                      LaSalle      14899    7702    7197
## 242      California                         Mono      14146    7311    6835
## 243           Texas                        Moore      22281   11512   10769
## 244        Missouri                        Texas      25735   13296   12439
## 245      New Mexico                        Curry      50497   26088   24409
## 246           Texas                       Medina      47392   24482   22910
## 247           Texas                         Hale      35504   18337   17167
## 248           Texas                       Newton      14231    7347    6884
## 249         Indiana                        Henry      49146   25371   23775
## 250        Kentucky                      Carroll      10830    5590    5240
## 251           Idaho                       Elmore      26175   13509   12666
## 252       Minnesota                       Nobles      21687   11190   10497
## 253           Idaho                       Owyhee      11364    5863    5501
## 254  South Carolina                      Hampton      20473   10561    9912
## 255          Kansas                         Ford      34714   17906   16808
## 256        Missouri                   Washington      25048   12920   12128
## 257           Idaho                      Fremont      12945    6677    6268
## 258        Illinois                     Crawford      19541   10078    9463
## 259        Michigan                       Branch      43706   22539   21167
## 260    North Dakota                     Richland      16317    8414    7903
## 261        Michigan                     Montcalm      63004   32482   30522
## 262           Texas                    Limestone      23454   12091   11363
## 263        New York                      Clinton      81685   42107   39578
## 264            Iowa                         Page      15660    8072    7588
## 265      New Jersey                   Cumberland     157035   80938   76097
## 266         Georgia                       Camden      51445   26515   24930
## 267         Georgia                    Chattooga      25241   13007   12234
## 268        Kentucky                        Union      15138    7800    7338
## 269         Montana                      Sanders      11346    5846    5500
## 270   West Virginia                      Preston      33809   17419   16390
## 271         Montana                     Gallatin      95323   49111   46212
## 272       Minnesota                      Carlton      35443   18256   17187
## 273  North Carolina                      Pamlico      12982    6685    6297
## 274        Colorado                   Las Animas      14503    7467    7036
## 275           Texas                       Parmer      10004    5150    4854
## 276         Alabama                     Escambia      37935   19524   18411
## 277       Wisconsin                     Crawford      16483    8483    8000
## 278            Iowa                      Webster      37295   19193   18102
## 279   West Virginia                     Randolph      29365   15111   14254
## 280      Washington                 Grays Harbor      71419   36748   34671
## 281       Louisiana                      Jackson      16109    8288    7821
## 282      California                      Trinity      13373    6878    6495
## 283      Washington                 Pend Oreille      12968    6665    6303
## 284    North Dakota                  Grand Forks      68979   35439   33540
## 285          Alaska    Ketchikan Gateway Borough      13699    7038    6661
## 286    Pennsylvania                       McKean      42884   22029   20855
## 287        Illinois                    Jefferson      38578   19817   18761
## 288      New Mexico                          Lea      68149   35007   33142
## 289        Missouri                      Johnson      54155   27818   26337
## 290           Texas                       Potter     122352   62842   59510
## 291        Virginia                     New Kent      19560   10045    9515
## 292      California                         Kern     865736  444547  421189
## 293        Illinois                      Clinton      37929   19473   18456
## 294   West Virginia                    Hampshire      23542   12085   11457
## 295         Georgia                        Wayne      30046   15421   14625
## 296           Idaho                        Latah      38339   19676   18663
## 297          Nevada                  Carson City      54482   27959   26523
## 298        Arkansas                   Hot Spring      33316   17089   16227
## 299       Wisconsin                     Bayfield      15050    7719    7331
## 300          Oregon                    Jefferson      22061   11313   10748
## 301            Iowa                  Buena Vista      20507   10516    9991
## 302        New York                       Cayuga      79173   40576   38597
## 303  North Carolina                    Alexander      37158   19042   18116
## 304       Minnesota              Yellow Medicine      10092    5170    4922
## 305    North Dakota                        Walsh      11005    5635    5370
## 306         Montana                     Richland      11132    5699    5433
## 307          Alaska       Anchorage Municipality     299107  153122  145985
## 308      California                     Monterey     428441  219299  209142
## 309        Michigan                      Jackson     159759   81765   77994
## 310       Minnesota                       Roseau      15615    7991    7624
## 311        Oklahoma                        Craig      14744    7545    7199
## 312      New Mexico                   Los Alamos      17939    9179    8760
## 313          Alaska      Juneau City and Borough      32531   16645   15886
## 314      Washington                  Walla Walla      59726   30558   29168
## 315       Wisconsin                      Burnett      15334    7845    7489
## 316      California                     Imperial     178206   91167   87039
## 317        Colorado                     Garfield      57076   29186   27890
## 318           Texas                    Jefferson     252872  129292  123580
## 319         Arizona                         Yuma     202987  103779   99208
## 320            Utah                       Uintah      35721   18262   17459
## 321        Nebraska                       Seward      16998    8690    8308
## 322        Arkansas                        Scott      10870    5557    5313
## 323        Oklahoma                        Payne      79423   40596   38827
## 324         Georgia                       Coffee      43003   21980   21023
## 325           Texas                    Ochiltree      10642    5439    5203
## 326        Illinois                    Christian      34200   17478   16722
## 327      New Mexico                        Otero      65318   33379   31939
## 328       Wisconsin                       Oconto      37476   19149   18327
## 329        Michigan                    Missaukee      14988    7658    7330
## 330        Illinois                         Cass      13254    6772    6482
## 331       Wisconsin                       Taylor      20569   10509   10060
## 332       Minnesota                         Todd      24466   12500   11966
## 333         Georgia                     Crawford      12539    6406    6133
## 334        New York                     Sullivan      76330   38992   37338
## 335         Wyoming                      Lincoln      18316    9356    8960
## 336            Iowa                        Henry      20080   10254    9826
## 337       Louisiana                   Beauregard      36259   18513   17746
## 338           Idaho                     Boundary      10961    5596    5365
## 339         Indiana                   Tippecanoe     180952   92366   88586
## 340         Florida                   Santa Rosa     161021   82189   78832
## 341         Arizona                       La Paz      20335   10378    9957
## 342        Michigan                         Lake      11426    5831    5595
## 343  South Carolina                       Saluda      20000   10206    9794
## 344            Iowa                       Louisa      11271    5751    5520
## 345           Texas                      Andrews      16775    8557    8218
## 346           Texas                        Rains      11037    5630    5407
## 347       Louisiana                    Iberville      33229   16949   16280
## 348           Texas                     Cherokee      51167   26098   25069
## 349            Ohio                      Belmont      69560   35479   34081
## 350          Nevada                    Churchill      24252   12368   11884
## 351            Iowa                    Allamakee      14060    7169    6891
## 352         Arizona                      Cochise     129647   66100   63547
## 353            Utah                      Millard      12582    6414    6168
## 354      California              San Luis Obispo     276517  140953  135564
## 355            Utah                       Morgan      10276    5238    5038
## 356           Texas                      Bastrop      76948   39211   37737
## 357        Maryland                   Washington     149270   76058   73212
## 358       Minnesota                    Sherburne      90401   46062   44339
## 359            Iowa                       Jasper      36726   18711   18015
## 360        Colorado                    Archuleta      12174    6202    5972
## 361       Wisconsin                    Marquette      15140    7713    7427
## 362        Missouri                      Webster      36690   18691   17999
## 363    North Dakota                       Ramsey      11566    5892    5674
## 364      Washington                       Kitsap     255441  130126  125315
## 365    South Dakota                    Brookings      33046   16834   16212
## 366  North Carolina                    Granville      58109   29595   28514
## 367       Louisiana                   Evangeline      33768   17197   16571
## 368            Utah                       Summit      38521   19616   18905
## 369       Minnesota                         Rice      64886   33038   31848
## 370    Pennsylvania                   Schuylkill     146360   74521   71839
## 371   New Hampshire                         Coos      31870   16226   15644
## 372       Minnesota                       Aitkin      15839    8064    7775
## 373      Washington                      Whitman      46737   23794   22943
## 374        Michigan                     Kalkaska      17230    8770    8460
## 375      California                San Francisco     840763  427909  412854
## 376        New York                 St. Lawrence     112011   57007   55004
## 377       Minnesota                         Cass      28519   14512   14007
## 378    South Dakota                         Lake      12086    6149    5937
## 379          Kansas                       Seward      23274   11839   11435
## 380         Georgia                        Banks      18336    9327    9009
## 381         Florida                       Sumter     108501   55190   53311
## 382            Utah                       Sevier      20871   10615   10256
## 383          Kansas                       Finney      37133   18884   18249
## 384       Minnesota                     Renville      15171    7715    7456
## 385          Oregon                        Baker      16052    8163    7889
## 386          Kansas                    Jefferson      18898    9610    9288
## 387           Idaho                       Blaine      21309   10836   10473
## 388    North Dakota                       Morton      28985   14737   14248
## 389        Oklahoma                    Pittsburg      44961   22859   22102
## 390     Mississippi                        Stone      17978    9140    8838
## 391        Missouri                     Callaway      44566   22656   21910
## 392  North Carolina                       Bertie      20518   10429   10089
## 393      Washington                        Adams      19081    9698    9383
## 394         Wyoming                        Uinta      20930   10636   10294
## 395        Virginia                       Amelia      12777    6492    6285
## 396        Missouri                     McDonald      22763   11565   11198
## 397       Wisconsin                     Kewaunee      20483   10406   10077
## 398        Nebraska                       Platte      32642   16580   16062
## 399      California                         Yuba      73437   37300   36137
## 400        Michigan                      Lenawee      98902   50231   48671
## 401        Colorado                     La Plata      53182   27010   26172
## 402           Texas                     Franklin      10599    5383    5216
## 403         Georgia                       Gilmer      28673   14560   14113
## 404           Idaho                       Cassia      23369   11866   11503
## 405      New Mexico                         Eddy      55641   28251   27390
## 406       Wisconsin                     Columbia      56607   28741   27866
## 407         Montana                   Silver Bow      34549   17539   17010
## 408         Indiana                   Montgomery      38172   19378   18794
## 409            Iowa                         Lyon      11723    5951    5772
## 410         Indiana                      Pulaski      13047    6623    6424
## 411         Indiana                         Vigo     108268   54952   53316
## 412    North Dakota                     Stutsman      21076   10697   10379
## 413        Virginia                    Dickenson      15463    7848    7615
## 414       Minnesota                      Kanabec      16003    8121    7882
## 415        Illinois                         Bond      17313    8784    8529
## 416       Louisiana                    Concordia      20449   10375   10074
## 417         Florida                       Walton      59487   30176   29311
## 418        Kentucky                   Muhlenberg      31309   15881   15428
## 419        Illinois                        Logan      29956   15194   14762
## 420            Utah                     Duchesne      19817   10051    9766
## 421         Indiana                      Steuben      34267   17379   16888
## 422        Nebraska                    Box Butte      11310    5736    5574
## 423           Idaho                     Minidoka      20279   10284    9995
## 424       Minnesota                         Lake      10750    5451    5299
## 425     Mississippi                        Adams      31979   16214   15765
## 426       Wisconsin                      Buffalo      13319    6753    6566
## 427           Texas                     Brazoria     331741  168196  163545
## 428        Oklahoma                      Latimer      10774    5462    5312
## 429       Minnesota                         Pope      10948    5550    5398
## 430           Texas                       Brazos     205271  104060  101211
## 431           Idaho                   Washington      10025    5082    4943
## 432           Texas                         Lamb      13742    6966    6776
## 433           Texas                    Val Verde      48980   24828   24152
## 434        Kentucky                       Marion      19717    9994    9723
## 435         Georgia                      Liberty      64427   32654   31773
## 436         Montana                         Hill      16523    8374    8149
## 437        Arkansas                        Stone      12512    6340    6172
## 438       Minnesota                       Sibley      15021    7611    7410
## 439        Colorado                       Teller      23340   11826   11514
## 440      California                       Plumas      18966    9608    9358
## 441        Michigan                       Arenac      15424    7813    7611
## 442            Ohio                     Richland     122312   61944   60368
## 443           Idaho                     Franklin      12914    6540    6374
## 444         Montana                       Fergus      11468    5807    5661
## 445          Oregon                    Tillamook      25430   12875   12555
## 446       Wisconsin                        Clark      34518   17475   17043
## 447        Virginia                     Buchanan      23486   11889   11597
## 448        Michigan                       Oceana      26229   13277   12952
## 449      Washington                     Okanogan      41332   20922   20410
## 450         Indiana                      Spencer      20856   10557   10299
## 451       Wisconsin                    Lafayette      16835    8521    8314
## 452          Hawaii                     Honolulu     984178  498129  486049
## 453      New Mexico                    Roosevelt      19908   10076    9832
## 454         Indiana                         Knox      38062   19264   18798
## 455         Wyoming                      Natrona      80011   40495   39516
## 456       Wisconsin                       Sawyer      16483    8342    8141
## 457   West Virginia                       Taylor      16977    8592    8385
## 458            Iowa                     Marshall      40962   20730   20232
## 459        Michigan                     Mackinac      11044    5589    5455
## 460       Wisconsin                      Waupaca      52125   26378   25747
## 461       Minnesota                       Itasca      45354   22949   22405
## 462           Idaho                       Jerome      22653   11462   11191
## 463         Alabama                        Coosa      11027    5579    5448
## 464           Texas                     Caldwell      39347   19907   19440
## 465       Tennessee                      Jackson      11496    5816    5680
## 466           Texas                        Nolan      15061    7618    7443
## 467        New York                     Allegany      48070   24313   23757
## 468        Missouri                         Cole      76533   38701   37832
## 469        New York                        Lewis      27124   13715   13409
## 470        Illinois                       Morgan      35129   17761   17368
## 471        Virginia                      Augusta      74053   37440   36613
## 472       Minnesota                     Le Sueur      27707   14008   13699
## 473    Pennsylvania               Northumberland      94006   47525   46481
## 474       Minnesota                     Morrison      32962   16663   16299
## 475        Michigan                     Crawford      13895    7024    6871
## 476      California                        Glenn      28029   14168   13861
## 477       Wisconsin                       Monroe      45274   22883   22391
## 478           Idaho                    Jefferson      26792   13541   13251
## 479  North Carolina                      Caswell      23174   11711   11463
## 480        Nebraska                     Saunders      20913   10568   10345
## 481        Michigan                    Marquette      67582   34150   33432
## 482       Minnesota                         Polk      31547   15940   15607
## 483           Texas                       Gaines      18916    9556    9360
## 484         Indiana                         Owen      21192   10705   10487
## 485      Washington                        Grant      92070   46507   45563
## 486            Iowa                        Mills      14862    7507    7355
## 487         Indiana                       Newton      14057    7100    6957
## 488  North Carolina                       Craven     104450   52755   51695
## 489         Florida                     Okaloosa     192237   97092   95145
## 490         Indiana                     LaGrange      38084   19233   18851
## 491         Indiana                     Crawford      10591    5347    5244
## 492       Wisconsin                   Green Lake      18966    9574    9392
## 493  North Carolina                     McDowell      44961   22696   22265
## 494         Wyoming                     Converse      14101    7118    6983
## 495        Michigan                    Menominee      23717   11972   11745
## 496       Minnesota                       Meeker      23129   11675   11454
## 497        Michigan                      Osceola      23234   11728   11506
## 498            Utah                       Tooele      60893   30737   30156
## 499       Wisconsin                         Dunn      44159   22290   21869
## 500           Idaho                      Payette      22700   11458   11242
## 501       Minnesota                      Jackson      10211    5154    5057
## 502           Texas                        Bowie      93155   47014   46141
## 503            Utah                        Emery      10728    5414    5314
## 504      New Mexico                       Cibola      27382   13817   13565
## 505          Kansas                         Rice      10014    5053    4961
## 506      California                       Merced     263885  133152  130733
## 507         Wyoming                     Big Horn      11895    6002    5893
## 508        Nebraska                       Dawson      24069   12142   11927
## 509         Indiana                       Martin      10262    5176    5086
## 510          Oregon                       Benton      86495   43624   42871
## 511            Iowa                        Sioux      34509   17404   17105
## 512        Missouri               Ste. Genevieve      17990    9072    8918
## 513       Wisconsin                        Price      13800    6959    6841
## 514        Kentucky                        Meade      29098   14672   14426
## 515            Utah                    Box Elder      50991   25709   25282
## 516        Illinois                   Livingston      37689   19002   18687
## 517     Mississippi                       George      23104   11648   11456
## 518       Wisconsin                     Richland      17746    8946    8800
## 519        Michigan                       Lapeer      88235   44477   43758
## 520       Minnesota                      Wabasha      21381   10777   10604
## 521    North Dakota                         Cass     162500   81905   80595
## 522  North Carolina                       Warren      20468   10316   10152
## 523          Kansas                       Cowley      36079   18184   17895
## 524       Minnesota                       Isanti      38296   19300   18996
## 525       Wisconsin                       Oneida      35653   17968   17685
## 526    Pennsylvania                       Fulton      14694    7405    7289
## 527   West Virginia                        Hardy      13936    7023    6913
## 528      Washington                     Skamania      11243    5665    5578
## 529        Virginia                     Culpeper      48424   24398   24026
## 530        Virginia                        Scott      22570   11369   11201
## 531   West Virginia                      Fayette      45534   22936   22598
## 532    South Dakota                        Union      14842    7476    7366
## 533       Wisconsin                       Barron      45686   23012   22674
## 534       Tennessee                        Lewis      11944    6016    5928
## 535           Texas                       Travis    1121645  564941  556704
## 536           Texas                          Lee      16664    8393    8271
## 537       Wisconsin                  Trempealeau      29412   14813   14599
## 538        Missouri                        Lewis      10172    5123    5049
## 539          Kansas                         Reno      64058   32262   31796
## 540            Ohio                        Allen     105196   52977   52219
## 541            Iowa                     Crawford      17252    8688    8564
## 542         Georgia                   Washington      20785   10467   10318
## 543        Michigan                         Cass      51952   26159   25793
## 544        Virginia                     Stafford     137145   69055   68090
## 545       Minnesota                    Faribault      14230    7165    7065
## 546        Illinois                      De Witt      16388    8251    8137
## 547       Wisconsin                         Polk      43572   21937   21635
## 548    Pennsylvania                      Indiana      87895   44252   43643
## 549        Nebraska                        Adams      31442   15829   15613
## 550        Illinois                   Cumberland      10943    5509    5434
## 551         Georgia                       Lanier      10403    5237    5166
## 552    South Dakota                      Davison      19787    9961    9826
## 553        Illinois                       Greene      13502    6797    6705
## 554           Texas                       Blanco      10723    5398    5325
## 555            Utah                      Wasatch      26661   13421   13240
## 556         Arizona                       Mohave     203362  102371  100991
## 557    South Dakota                    Codington      27750   13969   13781
## 558   West Virginia                     McDowell      20802   10471   10331
## 559          Nevada                         Lyon      51657   26001   25656
## 560       Minnesota                      Stearns     152595   76804   75791
## 561        Michigan                       Alcona      10550    5310    5240
## 562  North Carolina                       Camden      10161    5114    5047
## 563        Missouri                       Morgan      20225   10179   10046
## 564            Iowa                       Keokuk      10291    5179    5112
## 565       Louisiana                 West Carroll      11454    5764    5690
## 566        Colorado                        Adams     471206  237107  234099
## 567            Utah                        Davis     323374  162703  160671
## 568           Idaho                      Madison      37916   19077   18839
## 569    Pennsylvania                        Perry      45677   22981   22696
## 570        Colorado                      El Paso     655024  329550  325474
## 571    Pennsylvania                  Susquehanna      42369   21316   21053
## 572     Mississippi                     Pontotoc      30517   15353   15164
## 573        New York                     Schuyler      18410    9262    9148
## 574            Utah                         Utah     551957  277687  274270
## 575        Michigan                      Mecosta      43301   21784   21517
## 576         Montana                     Missoula     111966   56328   55638
## 577           Idaho                     Shoshone      12571    6324    6247
## 578       Wisconsin                    Winnebago     169004   85019   83985
## 579       Minnesota                   Blue Earth      65125   32760   32365
## 580           Texas                       Waller      45847   23062   22785
## 581       Wisconsin                    Sheboygan     115226   57960   57266
## 582          Nevada                       Washoe     435019  218795  216224
## 583        Michigan                      Newaygo      48029   24156   23873
## 584      Washington                      Lincoln      10363    5212    5151
## 585       Minnesota                   Mille Lacs      25809   12980   12829
## 586         Georgia                       Monroe      26915   13536   13379
## 587        Colorado                         Weld     270948  136230  134718
## 588       Minnesota                       Wright     128691   64704   63987
## 589        Missouri                        Ralls      10243    5150    5093
## 590      Washington                      Stevens      43548   21895   21653
## 591         Indiana                     Jennings      28113   14134   13979
## 592          Kansas                       Nemaha      10159    5107    5052
## 593           Texas                        Ector     149557   75182   74375
## 594        Michigan                        Barry      59147   29733   29414
## 595       Minnesota                    Kandiyohi      42444   21336   21108
## 596       Minnesota                    St. Louis     200506  100791   99715
## 597         Montana                         Park      15708    7896    7812
## 598          Nevada                      Douglas      47259   23755   23504
## 599        Michigan                      Gladwin      25501   12818   12683
## 600      California                  Santa Clara    1868149  939004  929145
## 601  South Carolina                       Jasper      26549   13344   13205
## 602            Iowa                      Hancock      11092    5575    5517
## 603       Wisconsin                         Rusk      14357    7216    7141
## 604     Mississippi                      Carroll      10338    5196    5142
## 605        Kentucky                       Butler      12835    6450    6385
## 606         Alabama                    Limestone      88805   44626   44179
## 607        Colorado                      Boulder     310032  155795  154237
## 608        Illinois                   Jo Daviess      22397   11254   11143
## 609        Kentucky                    Breathitt      13591    6829    6762
## 610       Wisconsin                     Marathon     135177   67921   67256
## 611         Alabama                    St. Clair      85864   43141   42723
## 612           Texas                         Clay      10479    5265    5214
## 613         Georgia                        Bacon      11222    5638    5584
## 614        New York                     Columbia      62195   31244   30951
## 615         Montana                      Lincoln      19337    9714    9623
## 616      Washington                      Pacific      20645   10371   10274
## 617            Iowa                          Lee      35369   17767   17602
## 618         Indiana                         Pike      12687    6373    6314
## 619       Minnesota                      Redwood      15723    7898    7825
## 620      California                    San Diego    3223096 1618945 1604151
## 621        Missouri                      Clinton      20498   10296   10202
## 622       Minnesota                   Pennington      14110    7087    7023
## 623            Ohio                       Mercer      40863   20523   20340
## 624          Hawaii                         Maui     160863   80790   80073
## 625            Ohio                      Hocking      28914   14521   14393
## 626          Kansas                       Butler      66092   33192   32900
## 627            Iowa                    Chickasaw      12244    6149    6095
## 628            Utah                    Salt Lake    1078958  541831  537127
## 629    Pennsylvania                      Wyoming      28147   14134   14013
## 630      New Mexico                     Valencia      76297   38310   37987
## 631            Iowa                      Fayette      20589   10338   10251
## 632          Kansas                     Crawford      39304   19733   19571
## 633      California                     Mariposa      17789    8931    8858
## 634         Wyoming                      Laramie      95431   47911   47520
## 635      Washington                    Snohomish     746653  374847  371806
## 636        Kentucky                     Edmonson      12105    6077    6028
## 637        Illinois                         Knox      52112   26159   25953
## 638       Wisconsin                      Calumet      49678   24937   24741
## 639        Kentucky                     Magoffin      12979    6515    6464
## 640            Ohio                       Vinton      13234    6643    6591
## 641        Illinois                      Jackson      59534   29883   29651
## 642        Oklahoma                     Le Flore      49899   25046   24853
## 643        Nebraska                         Cass      25360   12729   12631
## 644      Washington                      Douglas      39599   19875   19724
## 645            Ohio                       Warren     219916  110375  109541
## 646        Missouri                       Warren      33043   16584   16459
## 647        Arkansas                 Little River      12720    6384    6336
## 648       Tennessee                      Stewart      13286    6668    6618
## 649           Texas                   Hutchinson      21858   10970   10888
## 650          Hawaii                        Kauai      69691   34971   34720
## 651           Texas                     Gonzales      20172   10122   10050
## 652           Texas                     Chambers      37251   18691   18560
## 653        Colorado                       Elbert      23855   11969   11886
## 654            Ohio                    Ashtabula      99777   50062   49715
## 655         Alabama                      Fayette      16896    8477    8419
## 656        Virginia                        Floyd      15523    7788    7735
## 657            Utah                        Weber     238682  119748  118934
## 658            Iowa                        Boone      26401   13244   13157
## 659            Ohio                   Columbiana     105987   53166   52821
## 660        Colorado                      Prowers      12235    6137    6098
## 661   West Virginia                      Braxton      14466    7256    7210
## 662       Tennessee                         Polk      16687    8370    8317
## 663        Virginia                  King George      24933   12506   12427
## 664           Texas                       Wilson      45509   22823   22686
## 665      California                         Inyo      18373    9214    9159
## 666       Wisconsin                        Vilas      21355   10709   10646
## 667       Minnesota                   Otter Tail      57511   28840   28671
## 668           Idaho                          Gem      16731    8390    8341
## 669         Arizona                       Navajo     107656   53984   53672
## 670        Illinois                      LaSalle     112579   56448   56131
## 671         Vermont                      Orleans      27146   13611   13535
## 672        Michigan                      Wexford      32751   16421   16330
## 673        Nebraska                    Lancaster     298080  149454  148626
## 674     Mississippi                    Oktibbeha      49048   24592   24456
## 675        New York                     Delaware      46901   23515   23386
## 676        Colorado                        Delta      30214   15148   15066
## 677         Georgia                      Pickens      29740   14910   14830
## 678       Wisconsin                         Sauk      62992   31578   31414
## 679      California                Santa Barbara     435850  218483  217367
## 680      New Mexico                       Sierra      11615    5822    5793
## 681   West Virginia                      Raleigh      78493   39343   39150
## 682       Wisconsin                    Marinette      41287   20693   20594
## 683        Missouri                        Barry      35726   17905   17821
## 684            Ohio                       Shelby      49067   24590   24477
## 685       Minnesota                     Fillmore      20843   10445   10398
## 686         Indiana                        Noble      47546   23826   23720
## 687          Oregon                      Yamhill     101119   50669   50450
## 688        New York                   Livingston      64801   32469   32332
## 689     Connecticut                      Tolland     151948   76134   75814
## 690          Nevada                        Clark    2035572 1019927 1015645
## 691    South Dakota                    Minnehaha     178942   89658   89284
## 692      California                         Lake      64158   32146   32012
## 693          Hawaii                       Hawaii     191482   95939   95543
## 694         Indiana                        Brown      15011    7521    7490
## 695         Indiana                     Fountain      16888    8461    8427
## 696        Arkansas                        Logan      22001   11022   10979
## 697        Virginia                       Warren      38481   19278   19203
## 698       Minnesota                       Becker      33138   16601   16537
## 699       Wisconsin                     Langlade      19551    9794    9757
## 700        Illinois                   Washington      14457    7242    7215
## 701        Virginia                      Carroll      29856   14955   14901
## 702      California                       Tulare     454033  227426  226607
## 703         Georgia                          Lee      28946   14499   14447
## 704        Oklahoma                       Custer      28978   14515   14463
## 705        Arkansas                      Madison      15702    7865    7837
## 706      New Mexico                       Colfax      12997    6510    6487
## 707        Oklahoma                       Pawnee      16499    8264    8235
## 708        Oklahoma                        Osage      48054   24068   23986
## 709        New York                       Oswego     121183   60691   60492
## 710      Washington                       Yakima     247408  123907  123501
## 711         Indiana                      Daviess      32411   16232   16179
## 712    South Dakota                      Roberts      10318    5167    5151
## 713           Texas                         Wise      61243   30667   30576
## 714           Idaho                       Bonner      41066   20563   20503
## 715           Texas                   Washington      34236   17143   17093
## 716         Montana                    Roosevelt      11072    5544    5528
## 717        Kentucky                      Spencer      17577    8801    8776
## 718         Georgia                       Barrow      72012   36057   35955
## 719       Wisconsin                      Portage      70432   35265   35167
## 720         Montana                      Cascade      82090   41098   40992
## 721            Ohio                       Holmes      43436   21746   21690
## 722        Oklahoma                      Washita      11649    5832    5817
## 723        Nebraska                   Red Willow      10946    5480    5466
## 724       Wisconsin                      Shawano      41563   20808   20755
## 725       Tennessee                       Monroe      45293   22675   22618
## 726      California                    Mendocino      87544   43827   43717
## 727            Ohio                        Perry      36025   18035   17990
## 728         Indiana                       Gibson      33668   16855   16813
## 729       Minnesota                  Koochiching      13054    6535    6519
## 730        Michigan                      Tuscola      54420   27243   27177
## 731       Minnesota                        Anoka     338764  169586  169178
## 732         Montana                       Carbon      10268    5140    5128
## 733       Minnesota                      Douglas      36620   18331   18289
## 734      Washington                     Kittitas      42204   21126   21078
## 735        Nebraska                         Hall      60792   30430   30362
## 736    South Dakota                   Pennington     106085   53101   52984
## 737            Ohio                       Putnam      34184   17110   17074
## 738         Indiana                      Madison     130280   65208   65072
## 739  North Carolina                       Stanly      60586   30324   30262
## 740       Minnesota                     Beltrami      45434   22740   22694
## 741     Mississippi                        Leake      23153   11588   11565
## 742            Ohio                       Morrow      34996   17515   17481
## 743        Missouri                    Bollinger      12356    6184    6172
## 744        New York                       Orange     375384  187873  187511
## 745    Pennsylvania                       Warren      40962   20500   20462
## 746    Pennsylvania                       Potter      17377    8696    8681
## 747            Utah                         Iron      47139   23588   23551
## 748            Iowa                    Dickinson      16967    8490    8477
## 749           Texas                      Calhoun      21666   10840   10826
## 750        Missouri                     Buchanan      89561   44809   44752
## 751            Ohio                         Pike      28396   14207   14189
## 752           Idaho                          Ada     417501  208879  208622
## 753        Kentucky                        Lewis      13790    6899    6891
## 754           Idaho                      Bingham      45407   22715   22692
## 755    South Dakota                         Clay      14011    7009    7002
## 756  North Carolina                       Pender      55166   27596   27570
## 757      Washington                        Lewis      75515   37774   37741
## 758            Iowa                      Mahaska      22396   11202   11194
## 759            Ohio                       Monroe      14547    7276    7271
## 760        Virginia                    Arlington     223945  112006  111939
## 761            Iowa                      O'Brien      14092    7048    7044
## 762         Indiana                         Cass      38476   19243   19233
## 763           Texas                       Parker     121418   60724   60694
## 764        Missouri                       Dallas      16564    8284    8280
## 765      California                     Humboldt     135034   67533   67501
## 766           Texas                 San Patricio      66070   33039   33031
## 767         Indiana                      Whitley      33330   16667   16663
## 768          Kansas                      Douglas     114967   57490   57477
## 769        Missouri                     Lawrence      38244   19124   19120
## 770        Oklahoma                   Kingfisher      15302    7651    7651
## 771          Oregon                     Columbia      49389   24694   24695
## 772       Wisconsin                         Iowa      23769   11884   11885
## 773       Wisconsin                      Ashland      15993    7996    7997
## 774       Minnesota                        Dodge      20290   10144   10146
## 775       Minnesota                      Houston      18812    9405    9407
## 776       Louisiana                    Avoyelles      41389   20692   20697
## 777           Texas                       Zapata      14308    7153    7155
## 778         Indiana                     Harrison      39230   19612   19618
## 779            Ohio                       Athens      64974   32482   32492
## 780     Connecticut                   New London     273185  136570  136615
## 781        Missouri                       Saline      23334   11665   11669
## 782      California                   San Benito      57557   28773   28784
## 783        Nebraska                        Sarpy     169192   84574   84618
## 784            Iowa                     Franklin      10489    5243    5246
## 785        Colorado                       Denver     649654  324730  324924
## 786        Michigan                   Livingston     184591   92258   92333
## 787       Tennessee                       DeKalb      19038    9515    9523
## 788           Texas                    Robertson      16532    8262    8270
## 789       Minnesota                      Hubbard      20574   10282   10292
## 790     Mississippi                       Kemper      10211    5103    5108
## 791            Ohio                      Carroll      28361   14173   14188
## 792        Illinois                      Carroll      14926    7459    7467
## 793        New York                    Schoharie      31913   15946   15967
## 794        Arkansas                   Washington     216432  108144  108288
## 795        Illinois                    Effingham      34332   17154   17178
## 796           Texas                        Young      18329    9158    9171
## 797         Indiana                       Fulton      20527   10256   10271
## 798            Ohio                       Seneca      55929   27944   27985
## 799      Washington                       Benton     184930   92396   92534
## 800            Iowa                      Clayton      17806    8896    8910
## 801      California                    El Dorado     182093   90970   91123
## 802          Kansas                     Marshall      10005    4998    5007
## 803            Iowa                     Cherokee      11853    5921    5932
## 804         Wyoming                      Fremont      40755   20358   20397
## 805        Missouri                       Howard      10182    5086    5096
## 806         Alabama                   Washington      16997    8490    8507
## 807         Indiana                      Jackson      43471   21713   21758
## 808            Iowa                       Benton      25803   12888   12915
## 809        Arkansas                     Cleburne      25711   12842   12869
## 810           Texas                      Bandera      20796   10387   10409
## 811        Kentucky                         Todd      12524    6255    6269
## 812        Michigan                       Benzie      17437    8708    8729
## 813    South Dakota                       Beadle      18168    9072    9096
## 814         Indiana                       Tipton      15573    7776    7797
## 815            Iowa                      Kossuth      15280    7629    7651
## 816           Texas                      Johnson     155450   77610   77840
## 817        Oklahoma                    Cleveland     268614  134102  134512
## 818         Indiana                       Dubois      42291   21113   21178
## 819    Pennsylvania                      Bedford      49086   24504   24582
## 820        Illinois                        Boone      53851   26882   26969
## 821       Wisconsin                      Douglas      43799   21864   21935
## 822         Alabama                       DeKalb      71068   35474   35594
## 823        Michigan                       Ogemaw      21222   10593   10629
## 824       Minnesota                        Mower      39227   19579   19648
## 825        Michigan                        Iosco      25401   12678   12723
## 826         Montana                     Flathead      93333   46583   46750
## 827        Oklahoma                        Adair      22236   11098   11138
## 828        Virginia                      Pulaski      34528   17232   17296
## 829   New Hampshire                      Carroll      47513   23712   23801
## 830         Indiana                     Franklin      22935   11446   11489
## 831       Wisconsin                     Walworth     103039   51420   51619
## 832      Washington                         King    2045756 1020901 1024855
## 833           Texas                         Bell     326041  162705  163336
## 834       Minnesota                    Crow Wing      63048   31459   31589
## 835            Iowa                    Muscatine      42913   21412   21501
## 836        Kentucky                       Hardin     107529   53651   53878
## 837        Oklahoma                      McClain      36512   18217   18295
## 838        Kentucky                      Webster      13357    6664    6693
## 839        Illinois                       Grundy      50277   25083   25194
## 840            Ohio                    Champaign      39393   19653   19740
## 841      California                       Fresno     956749  477316  479433
## 842        Missouri                      Douglas      13516    6743    6773
## 843        Illinois                    Champaign     205766  102654  103112
## 844         Alabama                     Cherokee      26008   12975   13033
## 845        Illinois                         Lake     702898  350658  352240
## 846         Georgia                         Hall     187916   93746   94170
## 847        Illinois                         Pike      16144    8053    8091
## 848    Pennsylvania                         Pike      56632   28249   28383
## 849       Minnesota                         Lyon      25699   12818   12881
## 850    Pennsylvania                      Juniata      24829   12384   12445
## 851        Michigan                 Presque Isle      13037    6502    6535
## 852            Iowa                     Delaware      17507    8731    8776
## 853            Utah                        Cache     117449   58573   58876
## 854       Louisiana                      Madison      11873    5921    5952
## 855        Maryland                   St. Mary's     109614   54662   54952
## 856          Kansas                        Osage      16080    8018    8062
## 857      California                       Tehama      63152   31489   31663
## 858        Kentucky                    Pendleton      14514    7237    7277
## 859       Wisconsin                      Lincoln      28286   14104   14182
## 860        Virginia                Manassas city      40743   20314   20429
## 861      Washington                       Chelan      74267   37027   37240
## 862         Indiana                    Hendricks     153435   76496   76939
## 863       Wisconsin                    Jefferson      84345   42050   42295
## 864        New York                       Putnam      99488   49597   49891
## 865         Indiana                     Dearborn      49679   24766   24913
## 866  South Carolina                     Berkeley     193613   96518   97095
## 867           Maine                  Piscataquis      17156    8552    8604
## 868          Kansas                        Ellis      28993   14452   14541
## 869       Wisconsin                    Outagamie     180430   89937   90493
## 870        Missouri                          Ray      23031   11480   11551
## 871       Minnesota                     Nicollet      33086   16492   16594
## 872       Minnesota                        Scott     137322   68449   68873
## 873         Indiana                    Kosciusko      77983   38870   39113
## 874      Washington                       Island      79329   39540   39789
## 875        New York                       Fulton      54606   27217   27389
## 876       Wisconsin                    St. Croix      86118   42922   43196
## 877        Missouri                       Benton      18854    9397    9457
## 878           Idaho                      Bannock      83604   41666   41938
## 879  North Carolina                      Watauga      52240   26035   26205
## 880        Arkansas                        Perry      10300    5133    5167
## 881            Iowa                      Madison      15644    7796    7848
## 882      New Mexico                         Luna      24789   12353   12436
## 883        Missouri                        Wayne      13397    6676    6721
## 884        Virginia               Prince William     437271  217901  219370
## 885        Illinois                        Union      17551    8746    8805
## 886       Tennessee                   Montgomery     185980   92677   93303
## 887            Iowa                       Wright      12936    6446    6490
## 888         Montana                       Custer      11945    5952    5993
## 889        Missouri                      Lincoln      53850   26831   27019
## 890           Texas                    Wilbarger      13158    6556    6602
## 891        Oklahoma                     Marshall      16014    7979    8035
## 892      New Jersey                    Hunterdon     126250   62901   63349
## 893         Montana                     Big Horn      13141    6547    6594
## 894        Kentucky                         Ohio      24065   11989   12076
## 895        Illinois                         Ogle      52397   26103   26294
## 896  North Carolina                    Currituck      24492   12201   12291
## 897        Missouri                      Nodaway      23186   11550   11636
## 898       Wisconsin                    Manitowoc      80521   40110   40411
## 899        Michigan                    Roscommon      24068   11989   12079
## 900        New York                      Genesee      59458   29617   29841
## 901        Michigan                      Allegan     112837   56205   56632
## 902       Wisconsin                       Vernon      30279   15082   15197
## 903        New York                       Oneida     233558  116334  117224
## 904        Kentucky                        Henry      15455    7698    7757
## 905        Illinois                      McHenry     307357  153090  154267
## 906  South Carolina                      Pickens     120124   59831   60293
## 907            Ohio                    Coshocton      36724   18291   18433
## 908        Kentucky                      Grayson      26001   12950   13051
## 909        New York                     Chenango      49549   24678   24871
## 910       Minnesota                      Goodhue      46377   23098   23279
## 911           Texas                       Austin      28886   14386   14500
## 912        Michigan                       Antrim      23267   11587   11680
## 913           Texas                         Ward      11225    5590    5635
## 914        Missouri                        Cedar      13892    6918    6974
## 915         Georgia                      Berrien      19019    9471    9548
## 916           Texas                         Hays     177562   88420   89142
## 917        Nebraska                      Buffalo      47958   23881   24077
## 918           Texas                     Burleson      17293    8611    8682
## 919        Illinois                       Shelby      22115   11012   11103
## 920        Michigan                    Cheboygan      25690   12791   12899
## 921          Oregon                   Hood River      22749   11326   11423
## 922        Kentucky                      Jackson      13357    6650    6707
## 923          Kansas                    Dickinson      19516    9716    9800
## 924         Georgia                    Effingham      54630   27196   27434
## 925         Indiana                       DeKalb      42449   21132   21317
## 926   West Virginia                        Lewis      16434    8181    8253
## 927       Tennessee                     Grainger      22736   11318   11418
## 928    North Dakota                       Barnes      11097    5524    5573
## 929        Virginia                    Charlotte      12313    6129    6184
## 930          Oregon                       Marion     323259  160907  162352
## 931           Texas                      Midland     151290   75306   75984
## 932   West Virginia                        Grant      11815    5881    5934
## 933    South Dakota                      Lincoln      49874   24825   25049
## 934   West Virginia                       Upshur      24560   12223   12337
## 935      California                       Solano     425753  211881  213872
## 936        Colorado                      Larimer     318227  158367  159860
## 937        Oklahoma                      Lincoln      34504   17171   17333
## 938            Ohio                       Morgan      14913    7421    7492
## 939            Ohio                     Williams      37386   18604   18782
## 940        Illinois                         Kane     524886  261189  263697
## 941        New York                      Orleans      42204   21001   21203
## 942      California                    Riverside    2298032 1143477 1154555
## 943        Michigan                        Mason      28711   14286   14425
## 944      New Mexico                       Chaves      65811   32746   33065
## 945       Tennessee                        Union      19096    9501    9595
## 946          Kansas                      Jackson      13400    6667    6733
## 947           Texas                       Upshur      40096   19949   20147
## 948            Ohio                       Hardin      31736   15789   15947
## 949      California                         Napa     140295   69798   70497
## 950  North Carolina                      Madison      21027   10461   10566
## 951            Utah                     San Juan      15152    7538    7614
## 952        Oklahoma                       Murray      13733    6832    6901
## 953        Michigan                    Hillsdale      46178   22972   23206
## 954      California               San Bernardino    2094769 1042053 1052716
## 955         Georgia                       Harris      32776   16304   16472
## 956            Iowa                    Appanoose      12669    6302    6367
## 957          Oregon                      Klamath      65972   32815   33157
## 958         Indiana                       Monroe     142404   70830   71574
## 959           Texas                       Harris    4356362 2166727 2189635
## 960            Iowa                      Johnson     139436   69350   70086
## 961         Indiana                  Bartholomew      79488   39534   39954
## 962      California                  San Joaquin     708554  352400  356154
## 963        New York                       Ulster     181300   90169   91131
## 964         Georgia                      Baldwin      45795   22776   23019
## 965         Georgia                    Whitfield     103456   51451   52005
## 966       Louisiana                       Sabine      24248   12059   12189
## 967            Iowa                   Winneshiek      20884   10386   10498
## 968        Oklahoma                       Rogers      89190   44355   44835
## 969        New York                     Dutchess     296928  147663  149265
## 970            Iowa                      Jackson      19572    9733    9839
## 971           Texas                     Atascosa      47050   23397   23653
## 972        Missouri                        Perry      19100    9498    9602
## 973            Ohio                     Paulding      19165    9530    9635
## 974         Georgia                       Putnam      21247   10565   10682
## 975         Arizona                         Gila      53165   26436   26729
## 976      California                     Siskiyou      43895   21825   22070
## 977       Minnesota                       Benton      39221   19500   19721
## 978        Missouri                    Jefferson     221577  110157  111420
## 979            Ohio                    Fairfield     149112   74128   74984
## 980      California                       Sutter      95247   47349   47898
## 981            Iowa                       Marion      33248   16527   16721
## 982         Indiana                        Posey      25567   12708   12859
## 983        Kentucky                        Boyle      29388   14607   14781
## 984       Minnesota                        Brown      25391   12620   12771
## 985         Vermont                     Lamoille      25027   12439   12588
## 986        Colorado                    Jefferson     552344  274525  277819
## 987        Missouri                     Franklin     101828   50610   51218
## 988    Pennsylvania                    Armstrong      67979   33786   34193
## 989         Indiana                      Carroll      20014    9947   10067
## 990         Montana              Lewis and Clark      65357   32482   32875
## 991        Illinois                     Richland      16127    8015    8112
## 992        Oklahoma                     Okmulgee      39446   19604   19842
## 993       Wisconsin                       Pierce      40799   20275   20524
## 994       Louisiana                  St. Bernard      42858   21298   21560
## 995  North Carolina                        Burke      89548   44499   45049
## 996    Pennsylvania                    Jefferson      44756   22240   22516
## 997            Ohio                     Defiance      38669   19215   19454
## 998         Georgia                         Ware      35723   17751   17972
## 999         Georgia                     Colquitt      46024   22869   23155
## 1000        Indiana                     Marshall      46962   23335   23627
## 1001       Oklahoma                     Garfield      62192   30902   31290
## 1002        Georgia                      Forsyth     196236   97505   98731
## 1003         Nevada                          Nye      42625   21179   21446
## 1004        Vermont                      Addison      36943   18355   18588
## 1005        Florida                     Escambia     306327  152196  154131
## 1006     New Jersey                       Hudson     662619  329204  333415
## 1007       Illinois                      Kendall     120036   59632   60404
## 1008    Mississippi                     Itawamba      23451   11650   11801
## 1009       Michigan                        Huron      32290   16041   16249
## 1010       Colorado                       Morgan      28359   14088   14271
## 1011      Wisconsin                     Washburn      15700    7799    7901
## 1012       Oklahoma                        Mayes      41007   20370   20637
## 1013        Georgia                   Oglethorpe      14688    7296    7392
## 1014       Virginia                    Frederick      81340   40404   40936
## 1015       Maryland                        Cecil     101960   50644   51316
## 1016        Indiana                       Greene      32815   16298   16517
## 1017     Washington                       Pierce     821952  408209  413743
## 1018     New Mexico                      Lincoln      19931    9898   10033
## 1019        Vermont                    Caledonia      31012   15400   15612
## 1020   North Dakota                     Burleigh      88223   43809   44414
## 1021     California                   Santa Cruz     269278  133714  135564
## 1022        Florida                     Suwannee      43595   21647   21948
## 1023  West Virginia                        Boone      24000   11917   12083
## 1024       Arkansas                       Marion      16458    8172    8286
## 1025           Iowa                     Buchanan      20998   10426   10572
## 1026          Texas                        Milam      24344   12087   12257
## 1027       Illinois                         Will     683995  339609  344386
## 1028       Arkansas                      Jackson      17597    8737    8860
## 1029   Pennsylvania                        Tioga      42284   20994   21290
## 1030       Kentucky                         Bath      11978    5947    6031
## 1031        Indiana                   Washington      27930   13867   14063
## 1032       Illinois                    Vermilion      80368   39901   40467
## 1033       Illinois                   Williamson      67121   33323   33798
## 1034       Missouri                   Montgomery      11939    5927    6012
## 1035        Indiana                        White      24388   12107   12281
## 1036       Kentucky                       Monroe      10765    5344    5421
## 1037          Texas                  San Jacinto      27023   13414   13609
## 1038           Iowa                       Butler      14966    7429    7537
## 1039           Ohio                       Ottawa      41162   20432   20730
## 1040 North Carolina                    Alleghany      10911    5416    5495
## 1041       Kentucky                        Knott      16000    7942    8058
## 1042     New Mexico                   San Miguel      28668   14230   14438
## 1043     Washington                      Clallam      72397   35935   36462
## 1044  West Virginia                      Lincoln      21560   10701   10859
## 1045       Virginia                Prince Edward      23022   11426   11596
## 1046       Illinois                       Mercer      16107    7994    8113
## 1047        Georgia                       Gordon      55889   27738   28151
## 1048 North Carolina                     Franklin      62296   30917   31379
## 1049       Colorado                      Douglas     306974  152339  154635
## 1050      Minnesota                       McLeod      36046   17888   18158
## 1051       Michigan                     Muskegon     171483   85094   86389
## 1052       Illinois                        Henry      49883   24752   25131
## 1053       Michigan                   St. Joseph      61022   30279   30743
## 1054       Virginia                    Dinwiddie      28110   13947   14163
## 1055       Virginia                     Tazewell      43870   21766   22104
## 1056           Iowa                       Hardin      17393    8629    8764
## 1057       Kentucky                        Grant      24670   12239   12431
## 1058           Iowa                     Mitchell      10762    5339    5423
## 1059       Colorado                         Mesa     147834   73340   74494
## 1060       Michigan                      Sanilac      42014   20843   21171
## 1061       Michigan                    St. Clair     160429   79585   80844
## 1062          Texas                        Brown      37833   18768   19065
## 1063         Kansas                    McPherson      29252   14511   14741
## 1064     New Jersey                       Sussex     145930   72391   73539
## 1065      Tennessee                    Humphreys      18240    9048    9192
## 1066          Texas                      Runnels      10445    5181    5264
## 1067        Indiana                       Jasper      33448   16591   16857
## 1068        Vermont                       Orange      28929   14349   14580
## 1069    Mississippi                     Harrison     196268   97348   98920
## 1070      Tennessee                     Cheatham      39422   19553   19869
## 1071      Wisconsin                        Brown     254717  126333  128384
## 1072        Montana                      Ravalli      40823   20247   20576
## 1073        Indiana                       Orange      19725    9783    9942
## 1074       Kentucky                        Wayne      20655   10244   10411
## 1075      Wisconsin                        Green      37044   18372   18672
## 1076       New York                      Chemung      88267   43776   44491
## 1077        Georgia                       Dawson      22673   11244   11429
## 1078   Pennsylvania                          Elk      31370   15557   15813
## 1079          Texas                       Burnet      44144   21891   22253
## 1080       Arkansas                       Sevier      17268    8563    8705
## 1081       Kentucky                      Clinton      10188    5052    5136
## 1082       New York                   Chautauqua     132646   65776   66870
## 1083  West Virginia                      Jackson      29256   14506   14750
## 1084       Oklahoma                     Canadian     126193   62570   63623
## 1085           Iowa                        Floyd      16050    7958    8092
## 1086           Ohio                     Auglaize      45873   22745   23128
## 1087       Oklahoma                        Creek      70761   35084   35677
## 1088       Kentucky                 Breckinridge      20061    9945   10116
## 1089        Indiana                        Boone      60511   29997   30514
## 1090          Idaho                       Canyon     198921   98609  100312
## 1091          Texas                       Zavala      12060    5978    6082
## 1092        Vermont                     Franklin      48418   24000   24418
## 1093      Minnesota                       Wadena      13759    6820    6939
## 1094     New Mexico                     San Juan     125133   62024   63109
## 1095      Wisconsin                   Washington     132921   65880   67041
## 1096           Iowa                     Harrison      14467    7170    7297
## 1097         Oregon                        Wasco      25492   12634   12858
## 1098       Oklahoma                        Grady      53612   26570   27042
## 1099      Wisconsin                         Dane     510198  252850  257348
## 1100     Washington                      Whatcom     207100  102637  104463
## 1101         Oregon                      Clatsop      37382   18526   18856
## 1102  New Hampshire                 Hillsborough     403972  200201  203771
## 1103       Nebraska                       Custer      10802    5353    5449
## 1104       Kentucky                         Boyd      48917   24241   24676
## 1105    Connecticut                      Windham     117470   58212   59258
## 1106 North Carolina                      Lincoln      79578   39434   40144
## 1107       Virginia                      Loudoun     351129  173986  177143
## 1108       Missouri                       Andrew      17328    8586    8742
## 1109        Alabama                         Dale      49866   24708   25158
## 1110       New York                      Steuben      98665   48887   49778
## 1111          Maine                         Knox      39723   19682   20041
## 1112       Colorado                   Broomfield      60699   30075   30624
## 1113     Washington                      Spokane     480832  238241  242591
## 1114          Texas                       Orange      83217   41232   41985
## 1115           Ohio                       Scioto      78017   38655   39362
## 1116      Louisiana                    Ascension     114738   56848   57890
## 1117          Idaho                   Bonneville     107788   53404   54384
## 1118         Kansas                 Pottawatomie      22625   11209   11416
## 1119           Iowa                         Iowa      16344    8097    8247
## 1120       Illinois                         Ford      13835    6854    6981
## 1121  West Virginia                      Wyoming      22866   11328   11538
## 1122          Maine                       Oxford      57421   28446   28975
## 1123         Kansas                     Sedgwick     506529  250929  255600
## 1124          Texas                         Wood      42712   21159   21553
## 1125       Michigan                    Dickinson      26012   12886   13126
## 1126        Georgia                      Bulloch      72386   35859   36527
## 1127       Missouri                    Gasconade      14948    7405    7543
## 1128       Nebraska                         Holt      10398    5151    5247
## 1129       Illinois                        Wayne      16555    8201    8354
## 1130       Illinois                        Piatt      16495    8171    8324
## 1131     Washington                       Skagit     119343   59117   60226
## 1132        Georgia                       Pierce      18934    9379    9555
## 1133      Tennessee                        Smith      19136    9479    9657
## 1134        Georgia                        Grady      25243   12504   12739
## 1135       Arkansas                       Fulton      12224    6055    6169
## 1136      Minnesota                     Freeborn      30897   15304   15593
## 1137     California                      Ventura     840833  416484  424349
## 1138       Colorado                    Montezuma      25700   12729   12971
## 1139       Kentucky                     Lawrence      15821    7836    7985
## 1140         Kansas                       Marion      12290    6087    6203
## 1141       Missouri                       Newton      58777   29111   29666
## 1142          Texas                   Montgomery     502586  248919  253667
## 1143          Texas                      Aransas      24292   12031   12261
## 1144       Maryland                      Garrett      29813   14765   15048
## 1145           Ohio                       Preble      41682   20643   21039
## 1146  Massachusetts                        Dukes      17048    8443    8605
## 1147        Georgia                      Jackson      61420   30418   31002
## 1148        Georgia                        Heard      11617    5753    5864
## 1149       Virginia                    Botetourt      33155   16419   16736
## 1150       Illinois                     Woodford      39106   19366   19740
## 1151        Indiana                         Rush      16991    8414    8577
## 1152       Nebraska                      Madison      35111   17387   17724
## 1153          Idaho                    Nez Perce      39779   19697   20082
## 1154           Iowa                      Guthrie      10740    5318    5422
## 1155           Ohio                        Wayne     115371   57126   58245
## 1156   Pennsylvania                       Snyder      40046   19828   20218
## 1157       Illinois                        Mason      14126    6994    7132
## 1158   South Dakota                     Lawrence      24645   12202   12443
## 1159      Minnesota                       Carver      95715   47387   48328
## 1160          Texas                    Matagorda      36598   18119   18479
## 1161       Arkansas                       Greene      43382   21477   21905
## 1162       Virginia                   Rockbridge      22444   11111   11333
## 1163      Tennessee                       Unicoi      18069    8945    9124
## 1164        Alabama                      Cullman      80965   40081   40884
## 1165          Texas                   Palo Pinto      27921   13822   14099
## 1166       New York                  Cattaraugus      78962   39089   39873
## 1167       Oklahoma                      Haskell      12850    6361    6489
## 1168        Georgia                     Brantley      18452    9134    9318
## 1169        Florida                          Bay     175353   86800   88553
## 1170     California                   Stanislaus     527367  261045  266322
## 1171       Missouri                       Ripley      13990    6925    7065
## 1172       Michigan                        Delta      36712   18172   18540
## 1173          Texas                      Coryell      76128   37681   38447
## 1174       Nebraska                         Otoe      15842    7841    8001
## 1175       Arkansas                      Johnson      25930   12834   13096
## 1176    Mississippi                        Amite      12840    6355    6485
## 1177      Wisconsin                       Racine     194895   96458   98437
## 1178       Virginia                      Fairfax    1128722  558606  570116
## 1179       Maryland                 Anne Arundel     555280  274800  280480
## 1180        Indiana                      Decatur      26240   12985   13255
## 1181           Ohio                     Delaware     185433   91762   93671
## 1182       Michigan                    Van Buren      75351   37287   38064
## 1183      Tennessee                       Warren      40015   19801   20214
## 1184       Arkansas                         Yell      21835   10804   11031
## 1185       Oklahoma                          Kay      45587   22556   23031
## 1186       New York                        Wayne      92416   45726   46690
## 1187      Louisiana                  Plaquemines      23599   11676   11923
## 1188        Indiana                       Morgan      69403   34338   35065
## 1189     California                        Butte     222564  110115  112449
## 1190       Virginia                   Washington      54759   27092   27667
## 1191        Georgia                         Hart      25498   12615   12883
## 1192          Texas                        Comal     119632   59187   60445
## 1193        Alabama                       Coffee      50884   25174   25710
## 1194       Virginia                    Middlesex      10717    5302    5415
## 1195       Arkansas                         Pope      62830   31083   31747
## 1196         Kansas                       Sumner      23638   11694   11944
## 1197       Illinois                     Macoupin      46844   23174   23670
## 1198       Kentucky                        Logan      26851   13283   13568
## 1199        Florida                       Martin     151586   74988   76598
## 1200      Louisiana              Jefferson Davis      31434   15550   15884
## 1201       Nebraska                         Gage      21818   10793   11025
## 1202       Colorado                   Rio Grande      11745    5810    5935
## 1203      Minnesota                     Watonwan      11054    5468    5586
## 1204 North Carolina                     Scotland      35932   17774   18158
## 1205          Texas                    Tom Green     115056   56913   58143
## 1206        Indiana                      Warrick      60995   30171   30824
## 1207      Tennessee                       Tipton      61674   30506   31168
## 1208     California                    Calaveras      44767   22143   22624
## 1209      Tennessee                         Rhea      32394   16022   16372
## 1210       Oklahoma                      Jackson      26056   12887   13169
## 1211  West Virginia                      Ritchie      10140    5015    5125
## 1212          Idaho                   Twin Falls      80004   39568   40436
## 1213       Missouri                         Iron      10322    5105    5217
## 1214          Texas                    Galveston     308163  152405  155758
## 1215       Oklahoma                     Sequoyah      41464   20506   20958
## 1216     Washington                        Clark     444506  219826  224680
## 1217       Virginia                     Caroline      29349   14513   14836
## 1218        Arizona                       Apache      72124   35663   36461
## 1219       Arkansas                       Benton     238198  117781  120417
## 1220         Kansas                      Bourbon      14812    7324    7488
## 1221           Ohio                      Clinton      41892   20714   21178
## 1222      Tennessee                       Grundy      13524    6687    6837
## 1223           Ohio                        Brown      44247   21878   22369
## 1224          Idaho                     Kootenai     145046   71718   73328
## 1225       Virginia              Winchester city      27168   13433   13735
## 1226      Wisconsin                         Wood      74012   36594   37418
## 1227  New Hampshire                   Rockingham     299006  147837  151169
## 1228           Ohio                        Adams      28229   13957   14272
## 1229        Georgia                       Walker      68285   33760   34525
## 1230        Georgia                   Jeff Davis      14990    7411    7579
## 1231       Kentucky                      Bullitt      76961   38049   38912
## 1232       Oklahoma                        Logan      44493   21995   22498
## 1233  West Virginia                       Monroe      13525    6686    6839
## 1234   Pennsylvania                        Adams     101767   50307   51460
## 1235 North Carolina                        Union     213422  105500  107922
## 1236       Maryland                 Queen Anne's      48600   24024   24576
## 1237   Pennsylvania                         York     439660  217323  222337
## 1238        Arizona                     Maricopa    4018143 1986158 2031985
## 1239      Minnesota                   Washington     246670  121924  124746
## 1240       Kentucky                       Nelson      44564   22027   22537
## 1241      Louisiana                   Terrebonne     112742   55725   57017
## 1242       Virginia                    Goochland      21721   10736   10985
## 1243         Oregon                         Linn     118971   58803   60168
## 1244           Iowa                      Dubuque      95906   47401   48505
## 1245         Kansas                       Saline      55735   27546   28189
## 1246       Missouri                      Laclede      35514   17552   17962
## 1247       Michigan                         Iron      11507    5687    5820
## 1248           Utah                   Washington     148244   73265   74979
## 1249       Maryland                      Carroll     167444   82754   84690
## 1250       Oklahoma                     Delaware      41409   20465   20944
## 1251       Virginia                     Fauquier      67463   33341   34122
## 1252      Wisconsin                      Kenosha     167738   82897   84841
## 1253          Texas                         Hunt      88052   43515   44537
## 1254        Indiana                      Elkhart     200685   99175  101510
## 1255          Texas                      Lubbock     290782  143698  147084
## 1256       Nebraska                         York      13825    6832    6993
## 1257       Arkansas                    Van Buren      17002    8402    8600
## 1258   South Dakota                        Butte      10292    5086    5206
## 1259        Arizona                     Coconino     136701   67553   69148
## 1260         Oregon                    Multnomah     768418  379725  388693
## 1261        Wyoming                     Sheridan      29738   14695   15043
## 1262          Maine                    Aroostook      70005   34592   35413
## 1263           Iowa                         Tama      17479    8637    8842
## 1264        Indiana                      Fayette      23773   11747   12026
## 1265       Virginia                   Gloucester      37001   18283   18718
## 1266       Kentucky                       Carter      27326   13502   13824
## 1267        Indiana                     Lawrence      45814   22637   23177
## 1268       Oklahoma                     Cherokee      48097   23764   24333
## 1269     California                       Orange    3116069 1539600 1576469
## 1270    Mississippi                  Pearl River      55196   27271   27925
## 1271 North Carolina                        Rowan     138361   68359   70002
## 1272        Alabama                       Blount      57710   28512   29198
## 1273  West Virginia                        Roane      14636    7231    7405
## 1274         Kansas                    Wyandotte     160806   79445   81361
## 1275       Michigan                   Shiawassee      69113   34144   34969
## 1276          Texas                         Hood      53171   26267   26904
## 1277        Florida                       Nassau      75880   37485   38395
## 1278         Oregon                        Crook      20956   10352   10604
## 1279       Missouri                    Lafayette      32916   16260   16656
## 1280          Texas                         Leon      16819    8308    8511
## 1281      Minnesota                         Clay      60879   30072   30807
## 1282       Kentucky                         Hart      18441    9109    9332
## 1283 North Carolina                        Gates      11724    5791    5933
## 1284     New Mexico                        Grant      29119   14383   14736
## 1285        Georgia                       Bartow     101336   50052   51284
## 1286       New York                        Tioga      50199   24794   25405
## 1287 North Carolina                     Randolph     142370   70316   72054
## 1288          Texas                        Ellis     157058   77570   79488
## 1289           Utah                       Carbon      20927   10335   10592
## 1290      Minnesota                       Martin      20350   10050   10300
## 1291       Kentucky                        Perry      28041   13848   14193
## 1292       Arkansas                      Bradley      11206    5534    5672
## 1293       Michigan                    Washtenaw     354092  174855  179237
## 1294 North Carolina                         Ashe      27114   13389   13725
## 1295         Oregon                      Douglas     107194   52932   54262
## 1296        Alabama                       Marion      30387   15004   15383
## 1297       Maryland                      Calvert      90114   44495   45619
## 1298 North Carolina                     Hertford      24368   12032   12336
## 1299       Michigan                        Emmet      33018   16303   16715
## 1300       Kentucky                        Boone     124617   61530   63087
## 1301       Michigan                        Clare      30710   15163   15547
## 1302           Ohio                     Sandusky      60187   29717   30470
## 1303       Kentucky                        Adair      18852    9308    9544
## 1304     California                       Nevada      98570   48668   49902
## 1305          Texas                    Guadalupe     143460   70832   72628
## 1306       Oklahoma                     McIntosh      20280   10013   10267
## 1307       Michigan                   Charlevoix      26134   12903   13231
## 1308       Kentucky                       Mercer      21342   10537   10805
## 1309           Ohio                     Clermont     200285   98883  101402
## 1310       New York                   Rensselaer     159900   78944   80956
## 1311       Oklahoma                       Nowata      10555    5211    5344
## 1312       Illinois                       DeKalb     104345   51515   52830
## 1313        Georgia                      Madison      28232   13938   14294
## 1314        Indiana                        Adams      34642   17102   17540
## 1315       Nebraska                      Lincoln      35896   17721   18175
## 1316      Louisiana                      Bossier     123403   60921   62482
## 1317       Illinois                     Tazewell     135697   66990   68707
## 1318           Iowa                     Plymouth      24853   12269   12584
## 1319          Texas                      Fayette      24849   12267   12582
## 1320     Washington                    Jefferson      30083   14850   15233
## 1321     Washington                    Klickitat      20820   10277   10543
## 1322       Kentucky                      Fleming      14544    7179    7365
## 1323       Illinois                      Hancock      18738    9249    9489
## 1324       Kentucky                       Leslie      10997    5428    5569
## 1325           Ohio                     Highland      43170   21308   21862
## 1326       Kentucky                      Bourbon      20013    9878   10135
## 1327           Ohio                       Fulton      42485   20969   21516
## 1328      Tennessee                   Rutherford     282558  139456  143102
## 1329       Missouri                     Crawford      24660   12170   12490
## 1330          Texas                        Cooke      38761   19129   19632
## 1331           Ohio                        Huron      58937   29086   29851
## 1332        Vermont                   Washington      59132   29182   29950
## 1333       Michigan                      Clinton      76905   37952   38953
## 1334  West Virginia                      Mineral      27755   13696   14059
## 1335       New York                     Herkimer      64034   31598   32436
## 1336          Texas                       Hardin      55375   27325   28050
## 1337          Maine                    Penobscot     153437   75714   77723
## 1338          Texas                      Hopkins      35645   17589   18056
## 1339   Pennsylvania                       Butler     185689   91628   94061
## 1340       Arkansas                       Miller      43652   21540   22112
## 1341        Florida                         Levy      39821   19649   20172
## 1342      Tennessee                      Bedford      45986   22691   23295
## 1343 North Carolina                         Hoke      51075   25202   25873
## 1344     Washington                      Cowlitz     102338   50495   51843
## 1345       Michigan                       Otsego      24141   11911   12230
## 1346  West Virginia                       Morgan      17475    8622    8853
## 1347        Alabama                      Chilton      43819   21619   22200
## 1348         Kansas                     Cherokee      20952   10337   10615
## 1349        Indiana                    Blackford      12476    6155    6321
## 1350 South Carolina                       Oconee      74949   36975   37974
## 1351       Kentucky                       Kenton     163007   80417   82590
## 1352        Alabama                      Jackson      52860   26076   26784
## 1353       Missouri                       Barton      12166    6001    6165
## 1354          Texas                       Dallas    2485003 1225722 1259281
## 1355   Pennsylvania                       Monroe     167881   82806   85075
## 1356           Ohio                      Fayette      28769   14190   14579
## 1357          Texas                    Jim Wells      41461   20450   21011
## 1358       Illinois                     Marshall      12173    6004    6169
## 1359           Iowa                       Bremer      24539   12103   12436
## 1360   Pennsylvania                   Cumberland     241427  119074  122353
## 1361        Indiana                        Wells      27796   13709   14087
## 1362       New York                     Saratoga     223774  110362  113412
## 1363   Pennsylvania                      Cambria     139381   68740   70641
## 1364       Maryland                    Frederick     241373  119038  122335
## 1365          Texas                   Deaf Smith      19245    9491    9754
## 1366           Ohio                   Tuscarawas      92697   45715   46982
## 1367      Louisiana                        Union      22533   11112   11421
## 1368           Ohio                        Henry      28015   13815   14200
## 1369          Maine                   Washington      32191   15874   16317
## 1370           Iowa                         Linn     216640  106827  109813
## 1371           Iowa                      Wapello      35315   17413   17902
## 1372 South Carolina                     Beaufort     171420   84523   86897
## 1373       Oklahoma                      Wagoner      75391   37173   38218
## 1374         Kansas                      Labette      21048   10378   10670
## 1375       Missouri                       Platte      93394   46047   47347
## 1376  Massachusetts                    Worcester     810935  399807  411128
## 1377 North Carolina                       Wilkes      68946   33990   34956
## 1378           Ohio                       Medina     174831   86190   88641
## 1379      Louisiana                     St. Mary      53441   26345   27096
## 1380      Tennessee                    Jefferson      52490   25876   26614
## 1381         Kansas                   Montgomery      34184   16851   17333
## 1382      Tennessee                      Fayette      38814   19133   19681
## 1383 North Carolina                      Harnett     124320   61281   63039
## 1384   Pennsylvania                       Carbon      64634   31860   32774
## 1385        Indiana                       Starke      23117   11395   11722
## 1386           Iowa                Pottawattamie      93213   45947   47266
## 1387          Texas                        Titus      32553   16046   16507
## 1388       Virginia                       Louisa      33986   16752   17234
## 1389         Oregon                    Deschutes     166622   82129   84493
## 1390       Kentucky                        Scott      50178   24733   25445
## 1391          Texas                     Colorado      20757   10231   10526
## 1392          Texas                      Navarro      48118   23717   24401
## 1393      Louisiana             West Baton Rouge      24669   12159   12510
## 1394           Ohio                      Wyandot      22467   11073   11394
## 1395          Texas                     Maverick      56548   27870   28678
## 1396       Kentucky                        Floyd      38649   19048   19601
## 1397       Virginia                      Grayson      15573    7675    7898
## 1398       Virginia                      Patrick      18264    9001    9263
## 1399  West Virginia                     Nicholas      25930   12779   13151
## 1400        Florida                       Putnam      72696   35825   36871
## 1401           Ohio                     Harrison      15633    7704    7929
## 1402   Pennsylvania                     Bradford      62228   30666   31562
## 1403        Alabama                         Clay      13537    6671    6866
## 1404          Texas                      Liberty      77486   38184   39302
## 1405   Pennsylvania                         Erie     279858  137907  141951
## 1406        Indiana                       Shelby      44441   21899   22542
## 1407  New Hampshire                      Grafton      89341   44024   45317
## 1408 North Carolina                     Columbus      57230   28199   29031
## 1409       Arkansas                         Clay      15400    7588    7812
## 1410       Arkansas                       Lonoke      70691   34831   35860
## 1411       Illinois                     Franklin      39694   19558   20136
## 1412          Texas                     Montague      19478    9597    9881
## 1413           Iowa                   Washington      22017   10847   11170
## 1414     California                  Los Angeles   10038388 4945351 5093037
## 1415 North Carolina                       Yadkin      37971   18706   19265
## 1416          Texas                       Bosque      17971    8853    9118
## 1417          Texas                      Hockley      23322   11489   11833
## 1418       Missouri                       Camden      43927   21638   22289
## 1419      Tennessee                       Benton      16261    8010    8251
## 1420        Vermont                      Windham      43858   21603   22255
## 1421          Texas                    Van Zandt      52736   25975   26761
## 1422       Missouri                       Wright      18449    9087    9362
## 1423           Ohio                        Logan      45484   22402   23082
## 1424  West Virginia                        Logan      35760   17612   18148
## 1425           Ohio                       Lorain     303152  149303  153849
## 1426       Illinois                         Clay      13582    6689    6893
## 1427 North Carolina                       Duplin      59453   29280   30173
## 1428       Michigan                       Ottawa     273136  134515  138621
## 1429       Kentucky                        Casey      15954    7857    8097
## 1430       Nebraska                   Washington      20257    9976   10281
## 1431      Tennessee                        Macon      22761   11209   11552
## 1432    Mississippi                     Marshall      36385   17918   18467
## 1433       Arkansas                        White      78660   38736   39924
## 1434          Maine                     Somerset      51577   25399   26178
## 1435         Kansas                        Miami      32688   16097   16591
## 1436    Connecticut                   Litchfield     186304   91740   94564
## 1437       Michigan               Grand Traverse      89907   44272   45635
## 1438       Colorado                       Pueblo     161519   79535   81984
## 1439 North Carolina                      Iredell     165066   81281   83785
## 1440      Minnesota                     Hennepin    1197776  589781  607995
## 1441        Alabama                     Crenshaw      13938    6863    7075
## 1442         Kansas                        Allen      13081    6441    6640
## 1443       Arkansas                     Franklin      17866    8797    9069
## 1444 South Carolina                    Abbeville      24997   12308   12689
## 1445       Kentucky                      Simpson      17704    8717    8987
## 1446       New York                      Suffolk    1501373  739210  762163
## 1447      Minnesota                     Chippewa      12154    5984    6170
## 1448       Michigan                       Monroe     150436   74065   76371
## 1449           Ohio                        Meigs      23473   11556   11917
## 1450        Indiana                      Clinton      32835   16165   16670
## 1451  West Virginia                     Marshall      32480   15990   16490
## 1452       Oklahoma                       Garvin      27455   13516   13939
## 1453      Tennessee                     Marshall      31159   15339   15820
## 1454       New York                     Tompkins     103855   51125   52730
## 1455       New York                   Montgomery      49779   24504   25275
## 1456     New Mexico                    Doña Ana     213963  105321  108642
## 1457       Michigan                     Leelanau      21772   10717   11055
## 1458       Nebraska                      Douglas     537655  264652  273003
## 1459      Tennessee                      Overton      22100   10878   11222
## 1460       Nebraska                     Cheyenne      10077    4960    5117
## 1461       New York                      Madison      72427   35647   36780
## 1462         Kansas                     Franklin      25753   12675   13078
## 1463        Alabama                      Winston      24130   11876   12254
## 1464       Michigan                          Bay     106698   52513   54185
## 1465         Oregon                   Washington     556210  273743  282467
## 1466   Pennsylvania                      Mifflin      46675   22971   23704
## 1467  West Virginia                     Berkeley     108724   53507   55217
## 1468       Illinois                  Rock Island     147161   72422   74739
## 1469  New Hampshire                    Merrimack     147262   72470   74792
## 1470         Oregon                    Clackamas     389438  191639  197799
## 1471        Indiana                      Johnson     145645   71670   73975
## 1472          Texas                      Wharton      41264   20305   20959
## 1473       Kentucky                   Montgomery      27167   13368   13799
## 1474       Illinois                    Whiteside      57525   28306   29219
## 1475          Texas                       Denton     731851  360112  371739
## 1476        Florida                      Collier     341091  167836  173255
## 1477     California                    San Mateo     748731  368416  380315
## 1478        Alabama                     Marshall      94318   46409   47909
## 1479       Missouri                        Bates      16643    8189    8454
## 1480        Indiana                       Porter     166570   81958   84612
## 1481           Iowa                     Woodbury     102530   50448   52082
## 1482        Georgia                     Cherokee     225944  111170  114774
## 1483        Florida                      Osceola     300870  148030  152840
## 1484     New Jersey                   Burlington     450556  221673  228883
## 1485       Kentucky                     Marshall      31181   15341   15840
## 1486   Pennsylvania                      Luzerne     320095  157486  162609
## 1487           Iowa                    Winnebago      10614    5222    5392
## 1488      Tennessee                       Cannon      13787    6783    7004
## 1489   Pennsylvania                       Mercer     115320   56734   58586
## 1490       Kentucky                        Allen      20355   10014   10341
## 1491      Louisiana                   St. Martin      53126   26136   26990
## 1492        Alabama                          Lee     150982   74277   76705
## 1493          Texas                       Nueces     352060  173195  178865
## 1494         Oregon                        Union      25745   12664   13081
## 1495   Pennsylvania                      Fayette     134851   66332   68519
## 1496        Montana                         Lake      29157   14342   14815
## 1497    Mississippi                       Tippah      22054   10848   11206
## 1498      Tennessee                       Putnam      73810   36303   37507
## 1499           Iowa                      Clinton      48365   23788   24577
## 1500     New Jersey                    Middlesex     830300  408374  421926
## 1501      Wisconsin                         Door      27731   13639   14092
## 1502        Arizona                         Pima     998537  491108  507429
## 1503          Texas                        Erath      40039   19692   20347
## 1504 North Carolina                         Dare      34863   17146   17717
## 1505           Iowa                         Clay      16537    8133    8404
## 1506           Iowa                         Polk     452369  222477  229892
## 1507    Mississippi                      Jackson     140676   69185   71491
## 1508          Texas                         Hill      34923   17174   17749
## 1509          Texas                        Bexar    1825502  897690  927812
## 1510       Missouri                       Miller      24956   12272   12684
## 1511           Ohio                       Geauga      93874   46161   47713
## 1512           Iowa                          Sac      10101    4967    5134
## 1513      Minnesota                       Winona      51213   25183   26030
## 1514       Arkansas                         Drew      18740    9215    9525
## 1515           Ohio                        Miami     103517   50902   52615
## 1516       Arkansas                    Jefferson      73548   36165   37383
## 1517       Virginia                       Clarke      14299    7031    7268
## 1518       Kentucky                       Estill      14476    7118    7358
## 1519   Pennsylvania                      Chester     509797  250671  259126
## 1520       Colorado                     Arapahoe     608310  299103  309207
## 1521       Virginia                        Wythe      29190   14352   14838
## 1522       Michigan                      Midland      83624   41115   42509
## 1523      Wisconsin                   Eau Claire     101281   49791   51490
## 1524       Arkansas                        Grant      18054    8875    9179
## 1525 North Carolina                      Sampson      63873   31398   32475
## 1526      Tennessee                     Lawrence      42226   20757   21469
## 1527       Illinois                     Kankakee     112221   55164   57057
## 1528        Indiana                       Ripley      28612   14064   14548
## 1529        Georgia                       Brooks      15637    7686    7951
## 1530      Tennessee                   Sequatchie      14592    7172    7420
## 1531       New York                       Broome     198093   97363  100730
## 1532           Ohio                     Guernsey      39626   19475   20151
## 1533          Maine                        Waldo      38976   19155   19821
## 1534        Indiana                   Huntington      36863   18116   18747
## 1535          Texas                   Williamson     473592  232740  240852
## 1536         Oregon                         Lane     357060  175470  181590
## 1537 North Carolina                     Caldwell      81758   40178   41580
## 1538       Virginia                         York      66471   32665   33806
## 1539       Kentucky                      Letcher      23671   11632   12039
## 1540 North Carolina                      Jackson      40812   20055   20757
## 1541        Florida                       Orange    1229039  603946  625093
## 1542        Georgia                      Lowndes     113203   55627   57576
## 1543 North Carolina                        Wayne     124355   61106   63249
## 1544     New Mexico                     Sandoval     136638   67141   69497
## 1545       Virginia          Virginia Beach city     448290  220275  228015
## 1546           Ohio                         Wood     128885   63329   65556
## 1547       Virginia                 Spotsylvania     127691   62742   64949
## 1548          Texas                    Fort Bend     658331  323468  334863
## 1549        Indiana                          Jay      21255   10443   10812
## 1550        Indiana                        Scott      23783   11685   12098
## 1551          Maine                     Franklin      30402   14937   15465
## 1552 North Carolina                     Johnston     178396   87648   90748
## 1553       Virginia                      Bedford      76463   37567   38896
## 1554   Pennsylvania                      Venango      53906   26484   27422
## 1555       Missouri                  St. Charles     374805  184139  190666
## 1556        Alabama                       Geneva      26815   13174   13641
## 1557      Louisiana                    Lafourche      97474   47888   49586
## 1558      Louisiana                   Washington      46556   22872   23684
## 1559       Kentucky                      Fayette     308306  151459  156847
## 1560           Iowa                       Grundy      12407    6095    6312
## 1561      Louisiana                   Livingston     133949   65803   68146
## 1562     New Mexico                   Rio Arriba      39949   19625   20324
## 1563       Arkansas                       Saline     113833   55920   57913
## 1564       Arkansas                       Chicot      11353    5577    5776
## 1565      Tennessee                        Meigs      11716    5755    5961
## 1566       Michigan                         Kent     622590  305818  316772
## 1567       Kentucky                       Powell      12447    6114    6333
## 1568        Indiana                       Wabash      32358   15894   16464
## 1569       Kentucky                        Larue      14149    6949    7200
## 1570      Wisconsin                     Waukesha     393873  193441  200432
## 1571       Michigan                       Alpena      29068   14276   14792
## 1572   Pennsylvania                     Lycoming     116656   57292   59364
## 1573       Colorado                        Otero      18572    9121    9451
## 1574   North Dakota                      Rolette      14498    7120    7378
## 1575    Mississippi                     Lawrence      12586    6181    6405
## 1576       Kentucky                   Washington      11910    5849    6061
## 1577         Oregon                         Coos      62775   30828   31947
## 1578        Montana                  Yellowstone     153692   75474   78218
## 1579  New Hampshire                     Sullivan      43135   21182   21953
## 1580  West Virginia                    Jefferson      55214   27113   28101
## 1581          Texas                      Randall     126782   62256   64526
## 1582   Pennsylvania                        Berks     413965  203262  210703
## 1583       Virginia                         Page      23843   11707   12136
## 1584        Florida                         Clay     197417   96931  100486
## 1585        Alabama                       Morgan     119786   58814   60972
## 1586      Wisconsin                         Rock     160727   78915   81812
## 1587       Virginia                    Alleghany      16066    7888    8178
## 1588      Minnesota                       Dakota     408456  200534  207922
## 1589     California                       Shasta     178942   87851   91091
## 1590        Wyoming                         Park      28985   14230   14755
## 1591        Georgia                     Muscogee     200285   98321  101964
## 1592          Texas                       Collin     862215  423260  438955
## 1593       Kentucky                         Owen      10711    5258    5453
## 1594       Oklahoma                       Ottawa      32085   15750   16335
## 1595  West Virginia                       Wetzel      16157    7931    8226
## 1596        Indiana                   Vermillion      15860    7785    8075
## 1597        Georgia                      Appling      18417    9040    9377
## 1598       Missouri                       Oregon      10979    5389    5590
## 1599       Missouri                       Pettis      42215   20721   21494
## 1600           Ohio                       Greene     164192   80590   83602
## 1601           Iowa                       Shelby      11992    5886    6106
## 1602           Iowa                        Scott     169994   83437   86557
## 1603           Ohio                   Washington      61351   30112   31239
## 1604       Kentucky                      Garrard      16976    8332    8644
## 1605        Georgia                      Lumpkin      30921   15176   15745
## 1606        Indiana                        Clark     113181   55549   57632
## 1607      Wisconsin                      Ozaukee      87273   42831   44442
## 1608           Ohio                        Darke      52356   25694   26662
## 1609          Texas                     Harrison      66417   32594   33823
## 1610 North Carolina                     Carteret      68228   33481   34747
## 1611      Tennessee                    Robertson      67426   33086   34340
## 1612       Kentucky                   Rockcastle      16942    8313    8629
## 1613        Georgia                        Rabun      16266    7981    8285
## 1614   Rhode Island                      Newport      82663   40559   42104
## 1615       Virginia                 Fairfax city      23402   11482   11920
## 1616       New York                      Ontario     109192   53574   55618
## 1617           Iowa                       Warren      47542   23326   24216
## 1618      Tennessee                      Dickson      50472   24762   25710
## 1619       Oklahoma                    McCurtain      33143   16260   16883
## 1620       New York                       Warren      65180   31977   33203
## 1621       Virginia                 Pittsylvania      62794   30806   31988
## 1622       Illinois                       Monroe      33539   16452   17087
## 1623       Colorado                     Montrose      40815   20021   20794
## 1624        Vermont                   Bennington      36589   17948   18641
## 1625    Mississippi                     Walthall      14978    7347    7631
## 1626     New Mexico                   Bernalillo     673943  330578  343365
## 1627      Minnesota                       Steele      36523   17915   18608
## 1628           Iowa                        Cedar      18375    9013    9362
## 1629       Oklahoma                     Johnston      11022    5406    5616
## 1630         Kansas                       Neosho      16423    8055    8368
## 1631        Vermont                      Rutland      60530   29688   30842
## 1632       Maryland                       Howard     304115  149157  154958
## 1633     California                       Sonoma     495078  242817  252261
## 1634          Texas                     Rockwall      85536   41950   43586
## 1635        Georgia                     Gwinnett     859234  421395  437839
## 1636       Arkansas                     Randolph      17695    8678    9017
## 1637       Illinois                      Douglas      19826    9723   10103
## 1638       Illinois                       DuPage     930412  456274  474138
## 1639       Kentucky                     Campbell      91475   44858   46617
## 1640 North Carolina                   Pasquotank      40018   19623   20395
## 1641       New York                     Rockland     320688  157250  163438
## 1642 South Carolina                    Clarendon      34178   16759   17419
## 1643       Illinois                       Bureau      34115   16728   17387
## 1644       Virginia                   Rockingham      77785   38141   39644
## 1645          Texas                      Kaufman     109289   53588   55701
## 1646       Virginia                      Hanover     101340   49687   51653
## 1647       Virginia                     Franklin      56315   27611   28704
## 1648   Pennsylvania                     Franklin     152285   74663   77622
## 1649     Washington                     Thurston     262723  128796  133927
## 1650    Mississippi                      Hancock      45627   22367   23260
## 1651       Arkansas                     Crawford      61748   30269   31479
## 1652       Oklahoma                   Pushmataha      11226    5503    5723
## 1653 North Carolina                      Catawba     154610   75784   78826
## 1654   Pennsylvania                        Bucks     626583  307126  319457
## 1655       Virginia                        Giles      16907    8287    8620
## 1656           Iowa                         Cass      13590    6661    6929
## 1657        Georgia                        Union      21725   10648   11077
## 1658           Iowa                       Dallas      74892   36706   38186
## 1659        Georgia                         Pike      17812    8730    9082
## 1660        Indiana                         Clay      26686   13079   13607
## 1661       Arkansas                       Conway      21110   10346   10764
## 1662       Arkansas                     Faulkner     119343   58488   60855
## 1663 North Carolina                     Davidson     163867   80308   83559
## 1664       Missouri                         Dent      15612    7651    7961
## 1665       Missouri                        Henry      22034   10798   11236
## 1666  West Virginia                       Marion      56790   27830   28960
## 1667       Kentucky                       Laurel      59751   29281   30470
## 1668          Texas                     Victoria      90099   44152   45947
## 1669     California                      Alameda    1584983  776699  808284
## 1670          Texas                     Lampasas      20219    9908   10311
## 1671 North Carolina                     Mitchell      15330    7512    7818
## 1672       Missouri                         Cass     100781   49384   51397
## 1673       Colorado                      Alamosa      16269    7972    8297
## 1674      Tennessee                    Claiborne      31748   15556   16192
## 1675        Florida                          Lee     663675  325158  338517
## 1676      Tennessee                       Sevier      93617   45866   47751
## 1677       Arkansas                    Sebastian     127273   62355   64918
## 1678     New Jersey                       Morris     498192  244074  254118
## 1679        Florida                         Polk     626676  307017  319659
## 1680          Texas                     Eastland      18328    8979    9349
## 1681      Tennessee                       Wilson     122445   59985   62460
## 1682 North Carolina                        Surry      73170   35843   37327
## 1683          Texas                     Comanche      13623    6673    6950
## 1684       Kentucky                        Trigg      14250    6980    7270
## 1685      Tennessee                     Fentress      17931    8783    9148
## 1686   Pennsylvania                  Northampton     299616  146756  152860
## 1687        Georgia                       Jasper      13593    6658    6935
## 1688 South Carolina                    Lancaster      81263   39803   41460
## 1689           Ohio                       Butler     372538  182470  190068
## 1690       Kentucky                        Clark      35657   17464   18193
## 1691       Arkansas                      Carroll      27635   13535   14100
## 1692       Kentucky                        Mason      17296    8471    8825
## 1693          Texas                        Gregg     123178   60328   62850
## 1694          Maine                 Androscoggin     107393   52597   54796
## 1695      Tennessee                      Lincoln      33550   16431   17119
## 1696           Ohio                      Hancock      75428   36940   38488
## 1697      Wisconsin                  Fond du Lac     101920   49914   52006
## 1698       Virginia                Isle of Wight      35740   17503   18237
## 1699 North Carolina                     Cherokee      27092   13267   13825
## 1700        Georgia                     Franklin      22110   10827   11283
## 1701      Tennessee                       Marion      28306   13861   14445
## 1702  West Virginia                       Putnam      56596   27714   28882
## 1703        Georgia                         Polk      41215   20182   21033
## 1704        Alabama                     Randolph      22648   11090   11558
## 1705        Georgia                      Emanuel      22731   11130   11601
## 1706       Missouri                         Polk      31107   15231   15876
## 1707  New Hampshire                      Belknap      60399   29573   30826
## 1708           Ohio                         Knox      61004   29869   31135
## 1709 North Carolina                          Lee      59418   29090   30328
## 1710       Illinois                       Marion      38665   18929   19736
## 1711         Oregon                    Josephine      83409   40834   42575
## 1712          Texas                      Tarrant    1914526  937266  977260
## 1713       Kentucky                    Jessamine      50328   24638   25690
## 1714        Vermont                      Windsor      56150   27488   28662
## 1715    Connecticut                    Middlesex     165165   80854   84311
## 1716       Kentucky                      Pulaski      63635   31151   32484
## 1717       Virginia                Poquoson city      12077    5912    6165
## 1718           Iowa                  Cerro Gordo      43481   21284   22197
## 1719           Ohio                      Jackson      32854   16082   16772
## 1720       Missouri                        Stone      31375   15358   16017
## 1721     California                        Marin     258349  126460  131889
## 1722      Tennessee                        Roane      53162   26021   27141
## 1723      Tennessee                        Scott      22043   10789   11254
## 1724        Georgia                      Screven      14206    6953    7253
## 1725 North Carolina                       Stokes      46661   22837   23824
## 1726           Ohio                      Licking     168693   82562   86131
## 1727 North Carolina                      Chatham      67431   33001   34430
## 1728       Missouri                         Clay     230361  112738  117623
## 1729           Ohio                     Van Wert      28576   13984   14592
## 1730        Indiana                      Hancock      71328   34905   36423
## 1731          Texas                       Uvalde      26952   13189   13763
## 1732     New Jersey                       Mercer     370212  181158  189054
## 1733        Arizona                      Yavapai     215996  105693  110303
## 1734        Georgia                       Murray      39401   19280   20121
## 1735       Missouri                      Madison      12401    6068    6333
## 1736      Tennessee                        White      26252   12845   13407
## 1737       Kentucky                        Green      11149    5455    5694
## 1738         Kansas                      Johnson     566814  277330  289484
## 1739   Pennsylvania                    Lancaster     530216  259423  270793
## 1740      Minnesota                      Olmsted     148736   72773   75963
## 1741       Oklahoma                     Oklahoma     754480  369112  385368
## 1742     California                   Sacramento    1465832  717117  748715
## 1743  West Virginia                        Mingo      25931   12686   13245
## 1744       Kentucky                      Russell      17669    8644    9025
## 1745       Maryland                      Harford     248966  121798  127168
## 1746      Wisconsin                    La Crosse     117048   57261   59787
## 1747      Tennessee                       Carter      56941   27856   29085
## 1748           Ohio                      Ashland      53189   26020   27169
## 1749      Tennessee                        Giles      28949   14161   14788
## 1750       Virginia                       Orange      34596   16922   17674
## 1751      Tennessee                      Hawkins      56595   27682   28913
## 1752       Illinois                     Moultrie      14927    7301    7626
## 1753    Mississippi                     Prentiss      25380   12413   12967
## 1754        Georgia                     Paulding     147400   72089   75311
## 1755       Illinois                    McDonough      32009   15654   16355
## 1756        Alabama                      Madison     346438  169422  177016
## 1757    Mississippi                       Alcorn      37319   18250   19069
## 1758       Michigan                    Kalamazoo     256752  125558  131194
## 1759       Illinois                      Madison     267356  130741  136615
## 1760       Kentucky                         Pike      63434   31019   32415
## 1761           Ohio                         Erie      76141   37232   38909
## 1762       Oklahoma                        Noble      11506    5626    5880
## 1763       Oklahoma                     Muskogee      70224   34335   35889
## 1764       Nebraska                        Dodge      36725   17956   18769
## 1765  West Virginia                   Greenbrier      35666   17438   18228
## 1766        Alabama                     Lawrence      33586   16421   17165
## 1767   Pennsylvania                      Lebanon     135776   66380   69396
## 1768        Alabama                     Cleburne      15002    7334    7668
## 1769        Georgia                       Oconee      34400   16817   17583
## 1770       Arkansas                         Pike      11087    5420    5667
## 1771       Arkansas                 Independence      36952   18064   18888
## 1772          Texas                     Callahan      13532    6615    6917
## 1773        Georgia                        Bryan      33151   16205   16946
## 1774       Virginia                        Smyth      31734   15512   16222
## 1775       Virginia                 Westmoreland      17557    8582    8975
## 1776         Kansas                       Harvey      34835   17026   17809
## 1777       Illinois                       Jersey      22625   11058   11567
## 1778          Texas                       Taylor     134435   65704   68731
## 1779          Texas                     Angelina      87748   42886   44862
## 1780   Pennsylvania                     Crawford      87343   42688   44655
## 1781      Minnesota                   Cottonwood      11632    5685    5947
## 1782       Kentucky                      Whitley      35794   17493   18301
## 1783       Michigan                      Berrien     155565   76026   79539
## 1784        Florida                      Brevard     553591  270544  283047
## 1785       Virginia                      Russell      28245   13803   14442
## 1786          Texas                      El Paso     831095  406117  424978
## 1787        Georgia                       Coweta     133416   65194   68222
## 1788       Oklahoma                     Pontotoc      38055   18595   19460
## 1789       Kentucky                       Harlan      28400   13877   14523
## 1790           Ohio                      Portage     161897   79107   82790
## 1791   Pennsylvania                   Washington     208226  101743  106483
## 1792       Illinois                        Edgar      17992    8791    9201
## 1793       Oklahoma                     Seminole      25481   12450   13031
## 1794  Massachusetts                     Franklin      71144   34759   36385
## 1795 South Carolina                 Chesterfield      46192   22565   23627
## 1796       Michigan                      Calhoun     134790   65845   68945
## 1797           Ohio                     Lawrence      61827   30202   31625
## 1798 North Carolina                        Jones      10166    4966    5200
## 1799        Alabama                      Baldwin     195121   95314   99807
## 1800       Illinois                     Iroquois      29053   14192   14861
## 1801          Texas                    Henderson      79016   38598   40418
## 1802       Kentucky                      Johnson      23350   11406   11944
## 1803        Indiana                     Hamilton     296635  144898  151737
## 1804       Virginia              Chesapeake city     230601  112641  117960
## 1805   South Dakota                        Brown      38060   18591   19469
## 1806       Missouri                       Jasper     117184   57240   59944
## 1807 South Carolina                      Chester      32556   15902   16654
## 1808 South Carolina                    Lexington     273843  133755  140088
## 1809          Texas                       Shelby      25725   12565   13160
## 1810          Maine                      Hancock      54658   26696   27962
## 1811        Georgia                     Haralson      28565   13951   14614
## 1812           Iowa                   Black Hawk     132496   64707   67789
## 1813           Ohio                         Lake     229437  112045  117392
## 1814       Illinois                       Menard      12611    6158    6453
## 1815       Virginia                 King William      16097    7860    8237
## 1816        Vermont                   Chittenden     159711   77980   81731
## 1817       Kentucky                     Harrison      18648    9105    9543
## 1818      Louisiana                  St. Charles      52639   25699   26940
## 1819      Tennessee                       Sumner     169623   82811   86812
## 1820       Arkansas                     Poinsett      24210   11819   12391
## 1821      Louisiana                    Lafayette     231811  113159  118652
## 1822        Georgia                       Greene      16331    7972    8359
## 1823       Illinois                    Winnebago     290439  141776  148663
## 1824          Maine                      Lincoln      34156   16673   17483
## 1825           Iowa                     Hamilton      15297    7467    7830
## 1826  West Virginia                        Wayne      41499   20257   21242
## 1827  West Virginia                     Harrison      68998   33680   35318
## 1828        Georgia                         Long      16588    8097    8491
## 1829      Tennessee                        Cocke      35321   17241   18080
## 1830        Florida                    Highlands      98328   47994   50334
## 1831    Mississippi                      Simpson      27401   13373   14028
## 1832        Georgia                      Catoosa      65375   31905   33470
## 1833 North Carolina                     Cabarrus     188375   91929   96446
## 1834       Missouri                        Scott      39061   19062   19999
## 1835        Montana                      Glacier      13672    6672    7000
## 1836          Texas                      Grayson     122780   59917   62863
## 1837      Tennessee                       Greene      68576   33463   35113
## 1838     California                       Placer     366280  178729  187551
## 1839           Iowa                      Carroll      20629   10066   10563
## 1840        Georgia                      Decatur      27378   13359   14019
## 1841     New Mexico                         Taos      32943   16073   16870
## 1842       Illinois                       Wabash      11652    5685    5967
## 1843        Georgia                       Walton      86201   42055   44146
## 1844      Louisiana                    Calcasieu     195887   95567  100320
## 1845        Florida                 Hillsborough    1302884  635632  667252
## 1846        Indiana                     Randolph      25596   12487   13109
## 1847    Mississippi                    Yalobusha      12381    6040    6341
## 1848   Pennsylvania                 Westmoreland     361251  176227  185024
## 1849        Indiana                        Allen     363453  177301  186152
## 1850          Texas                         Camp      12516    6105    6411
## 1851  New Hampshire                    Strafford     125273   61104   64169
## 1852     California                 Contra Costa    1096068  534618  561450
## 1853       New York                     Cortland      49043   23921   25122
## 1854 North Carolina                     Richmond      46046   22459   23587
## 1855     New Jersey                     Somerset     330604  161252  169352
## 1856       Virginia                       Greene      18938    9237    9701
## 1857           Iowa                   Des Moines      40208   19611   20597
## 1858       Michigan                        Eaton     108341   52842   55499
## 1859      Tennessee                   Cumberland      57455   28022   29433
## 1860       Missouri                        Macon      15460    7540    7920
## 1861   Pennsylvania                        Blair     126448   61668   64780
## 1862        Florida                    St. Lucie     288006  140448  147558
## 1863          Texas                    Gillespie      25398   12385   13013
## 1864        Georgia                     Columbia     136204   66418   69786
## 1865      Tennessee                   Williamson     199456   97258  102198
## 1866       Kentucky                     Franklin      49778   24271   25507
## 1867 North Carolina                   Cumberland     324603  158270  166333
## 1868       Arkansas                     Lawrence      17029    8303    8726
## 1869       Arkansas                         Polk      20364    9929   10435
## 1870          Texas                       Sabine      10440    5090    5350
## 1871 South Carolina                     Newberry      37690   18375   19315
## 1872  West Virginia                       Brooke      23665   11537   12128
## 1873           Ohio                       Gallia      30565   14900   15665
## 1874        Georgia                     McIntosh      14007    6828    7179
## 1875  Massachusetts                    Middlesex    1556116  758542  797574
## 1876       Kentucky                       Warren     118950   57981   60969
## 1877          Texas                     McLennan     241505  117708  123797
## 1878        Alabama                       Elmore      80763   39362   41401
## 1879        Georgia                        Jones      28738   14006   14732
## 1880          Texas                      Jackson      14486    7060    7426
## 1881       Oklahoma                        Tulsa     623335  303789  319546
## 1882      Tennessee                      Carroll      28353   13818   14535
## 1883        Florida                      Volusia     503719  245481  258238
## 1884          Texas                      Hidalgo     819217  399234  419983
## 1885           Ohio                     Trumbull     206373  100568  105805
## 1886      Tennessee                     Franklin      41138   20046   21092
## 1887        Georgia                      Carroll     112595   54865   57730
## 1888       Arkansas                        Boone      37227   18138   19089
## 1889      Tennessee                       Hardin      25900   12619   13281
## 1890      Tennessee                      Hamblen      62999   30693   32306
## 1891       New York                        Yates      25187   12271   12916
## 1892 North Carolina                        Davie      41447   20192   21255
## 1893           Ohio                     Franklin    1215761  592278  623483
## 1894      Louisiana                       Iberia      73938   36017   37921
## 1895       Illinois                        Adams      67081   32676   34405
## 1896  Massachusetts                     Plymouth     503681  245337  258344
## 1897        Alabama                       Shelby     203530   99134  104396
## 1898       Kentucky                     Anderson      21761   10599   11162
## 1899     New Mexico                     Santa Fe     147108   71649   75459
## 1900   Pennsylvania                      Clarion      39454   19216   20238
## 1901     New Jersey                        Union     548744  267263  281481
## 1902       Missouri                       Greene     283206  137929  145277
## 1903           Ohio                     Crawford      42725   20808   21917
## 1904    Mississippi                    Lafayette      51169   24919   26250
## 1905       Kentucky                       Graves      37502   18263   19239
## 1906         Oregon                        Curry      22338   10878   11460
## 1907          Texas                         Webb     263251  128182  135069
## 1908    Mississippi                         Tate      28415   13835   14580
## 1909        Florida                    Charlotte     165783   80718   85065
## 1910       Kentucky                         Knox      31809   15487   16322
## 1911     California                         Yolo     207320  100937  106383
## 1912    Connecticut                    Fairfield     939983  457634  482349
## 1913    Mississippi                       Marion      26109   12711   13398
## 1914       Illinois                       McLean     173114   84278   88836
## 1915   Pennsylvania                       Lehigh     356756  173680  183076
## 1916       Kentucky                       Barren      42925   20895   22030
## 1917       Arkansas                        Sharp      17055    8302    8753
## 1918       Arkansas                    Craighead     101409   49362   52047
## 1919        Georgia                        White      27791   13527   14264
## 1920       Oklahoma                     Stephens      44806   21808   22998
## 1921       Virginia                     Accomack      33115   16117   16998
## 1922  West Virginia                       Cabell      96824   47123   49701
## 1923     New Jersey                      Passaic     507574  247026  260548
## 1924    Mississippi                        Wayne      20564   10008   10556
## 1925     New Jersey                     Monmouth     629185  306200  322985
## 1926      Tennessee                      Decatur      11686    5687    5999
## 1927  West Virginia                      Barbour      16731    8142    8589
## 1928 North Carolina                         Wake     976019  474967  501052
## 1929         Oregon                      Jackson     208363  101395  106968
## 1930       Arkansas                       Howard      13555    6596    6959
## 1931      Tennessee                      Bradley     102062   49664   52398
## 1932       Oklahoma                        Bryan      44003   21412   22591
## 1933     New Jersey                     Cape May      95805   46617   49188
## 1934         Oregon                      Lincoln      46347   22551   23796
## 1935       New York                       Otsego      61399   29873   31526
## 1936      Tennessee                       Loudon      50229   24438   25791
## 1937      Tennessee                      Weakley      34415   16743   17672
## 1938   Pennsylvania                      Clinton      39614   19271   20343
## 1939      Tennessee                       Coffee      53448   26000   27448
## 1940       New York                      Niagara     214150  104173  109977
## 1941 South Carolina                   Dorchester     145715   70881   74834
## 1942      Tennessee                   Washington     125317   60955   64362
## 1943 South Carolina                 Williamsburg      33238   16166   17072
## 1944       Michigan                       Macomb     854689  415691  438998
## 1945        Alabama                      Pickens      19856    9657   10199
## 1946    Mississippi                        Union      27811   13525   14286
## 1947 North Carolina                       Yancey      17604    8561    9043
## 1948        Georgia                         Dade      16445    7997    8448
## 1949      Louisiana                     Franklin      20550    9993   10557
## 1950   South Dakota                       Hughes      17466    8493    8973
## 1951       Arkansas                  Mississippi      44864   21815   23049
## 1952 South Carolina                        Horry     290730  141358  149372
## 1953        Alabama                      Russell      58302   28347   29955
## 1954           Ohio                     Mahoning     234550  114040  120510
## 1955     New Jersey                       Warren     107226   52134   55092
## 1956        Florida                    St. Johns     210495  102338  108157
## 1957 North Carolina                       Person      39262   19088   20174
## 1958 North Carolina                    Brunswick     115926   56359   59567
## 1959       Delaware                       Sussex     207302  100782  106520
## 1960          Maine                         York     199682   97075  102607
## 1961  New Hampshire                     Cheshire      76430   37155   39275
## 1962      Louisiana         St. John the Baptist      44161   21467   22694
## 1963    Mississippi                        Lamar      58885   28624   30261
## 1964      Tennessee                         Knox     444348  215996  228352
## 1965    Mississippi                        Perry      12187    5924    6263
## 1966     New Jersey                        Salem      65120   31653   33467
## 1967      Louisiana                   Assumption      23057   11206   11851
## 1968          Texas                      Trinity      14405    7001    7404
## 1969 South Carolina                      Laurens      66389   32264   34125
## 1970         Kansas                         Lyon      33462   16262   17200
## 1971        Georgia                       Fulton     983903  478161  505742
## 1972 South Carolina                   Greenville     474903  230780  244123
## 1973       Missouri                     Stoddard      29837   14499   15338
## 1974          Maine                   Cumberland     286119  139036  147083
## 1975      Minnesota                       Ramsey     527411  256288  271123
## 1976      Tennessee                      McNairy      26103   12684   13419
## 1977           Ohio                    Muskingum      86016   41796   44220
## 1978      Tennessee                       McMinn      52506   25513   26993
## 1979       Illinois                       Warren      17701    8601    9100
## 1980       Kentucky                         Bell      27950   13581   14369
## 1981       Kentucky                      Hopkins      46518   22603   23915
## 1982     New Jersey                   Gloucester     290298  141053  149245
## 1983       New York                  Schenectady     154796   75210   79586
## 1984      Louisiana                      Lincoln      47349   23005   24344
## 1985    Mississippi                       Jasper      16588    8058    8530
## 1986   Pennsylvania                   Montgomery     812970  394888  418082
## 1987        Georgia                       Fannin      23742   11531   12211
## 1988         Kansas                       Barton      27399   13307   14092
## 1989      Louisiana                   St. Helena      10818    5254    5564
## 1990       Illinois                       Saline      24783   12036   12747
## 1991       Michigan                      Oakland    1229503  597097  632406
## 1992       Virginia                  Northampton      12184    5917    6267
## 1993       Nebraska                       Dakota      20798   10100   10698
## 1994       Missouri                    Christian      80904   39288   41616
## 1995        Alabama                       Walker      65923   32013   33910
## 1996       Maryland                    Worcester      51519   25018   26501
## 1997    Mississippi                   Lauderdale      79832   38764   41068
## 1998 South Carolina                     Richland     397899  193206  204693
## 1999      Louisiana                      Webster      40617   19722   20895
## 2000      Louisiana                  St. Tammany     242960  117970  124990
## 2001        Florida                        Pasco     479288  232719  246569
## 2002    Mississippi                       DeSoto     168586   81857   86729
## 2003 South Carolina                  Spartanburg     291240  141396  149844
## 2004 North Carolina                   Rutherford      66865   32462   34403
## 2005       Kentucky                      Greenup      36477   17709   18768
## 2006        Florida                      Broward    1843152  894820  948332
## 2007      Louisiana                    Jefferson     435092  211216  223876
## 2008       Missouri                        Taney      53555   25998   27557
## 2009       Michigan                       Ingham     283491  137614  145877
## 2010          Texas                    Red River      12567    6100    6467
## 2011          Texas                       Dimmit      10682    5185    5497
## 2012         Kansas                      Shawnee     178792   86783   92009
## 2013          Texas                        Lamar      49566   24057   25509
## 2014 North Carolina                 Transylvania      32928   15981   16947
## 2015          Texas                         Cass      30328   14717   15611
## 2016       Virginia                       Nelson      14858    7210    7648
## 2017       Kentucky                       Taylor      24993   12128   12865
## 2018 North Carolina                   Montgomery      27601   13393   14208
## 2019      Louisiana                   Tangipahoa     125486   60890   64596
## 2020        Georgia                      Fayette     108655   52720   55935
## 2021      Tennessee                        Henry      32269   15657   16612
## 2022       Kentucky                      Madison      85838   41648   44190
## 2023          Maine                     Kennebec     121112   58759   62353
## 2024 South Carolina                     Cherokee      55863   27102   28761
## 2025        Indiana                        Wayne      67866   32924   34942
## 2026 North Carolina                        Macon      33919   16455   17464
## 2027        Florida                   Miami-Dade    2639042 1280221 1358821
## 2028           Ohio                        Stark     374979  181905  193074
## 2029       Virginia                   Shenandoah      42724   20725   21999
## 2030       Virginia                   Appomattox      15208    7377    7831
## 2031      Tennessee                     Anderson      75430   36585   38845
## 2032           Ohio                        Lucas     436261  211587  224674
## 2033        Alabama                    Talladega      81437   39494   41943
## 2034       Michigan                     Isabella      70669   34271   36398
## 2035        Georgia                      Houston     147570   71562   76008
## 2036   Rhode Island                   Washington     126405   61286   65119
## 2037       Maryland                     Caroline      32661   15835   16826
## 2038       New York                       Nassau    1354612  656738  697874
## 2039        Indiana                        Floyd      75900   36794   39106
## 2040 South Carolina                      Kershaw      62722   30405   32317
## 2041       Kentucky                      Daviess      98173   47590   50583
## 2042       New York                       Queens    2301139 1115459 1185680
## 2043          Texas                        Starr      62648   30368   32280
## 2044    Connecticut                     Hartford     896943  434784  462159
## 2045        Florida                        Duval     890673  431732  458941
## 2046       Michigan                      Saginaw     196479   95216  101263
## 2047           Ohio                    Jefferson      68053   32976   35077
## 2048       Illinois                         Cook    5236393 2537245 2699148
## 2049       Virginia                     Campbell      55012   26655   28357
## 2050       Virginia                   James City      70673   34241   36432
## 2051    Mississippi                      Winston      18697    9058    9639
## 2052          Texas                       Jasper      35768   17328   18440
## 2053 South Carolina                   Charleston     372904  180654  192250
## 2054       Illinois                       Peoria     187112   90646   96466
## 2055       Illinois                        Clark      16159    7828    8331
## 2056    Mississippi                        Scott      28293   13706   14587
## 2057        Indiana                   St. Joseph     267246  129457  137789
## 2058       Oklahoma                       Carter      48442   23465   24977
## 2059           Ohio                       Summit     541847  262467  279380
## 2060       Oklahoma                      Choctaw      15120    7324    7796
## 2061 North Carolina                       Gaston     209807  101627  108180
## 2062       Illinois                        White      14464    7006    7458
## 2063 North Carolina                    Henderson     109719   53143   56576
## 2064        Alabama                      Autauga      55221   26745   28476
## 2065   Pennsylvania                       Beaver     169785   82231   87554
## 2066      Louisiana                     Richland      20777   10062   10715
## 2067        Georgia                      Candler      11031    5342    5689
## 2068      Tennessee                     Campbell      40176   19455   20721
## 2069        Georgia                         Cobb     719133  348219  370914
## 2070        Georgia                        Floyd      96169   46567   49602
## 2071  Massachusetts                      Bristol     552763  267650  285113
## 2072      Tennessee                       Blount     125188   60612   64576
## 2073   Rhode Island                   Providence     630459  305247  325212
## 2074      Tennessee                     Sullivan     156752   75894   80858
## 2075       Virginia            Newport News city     181323   87790   93533
## 2076        Georgia                     Richmond     201291   97446  103845
## 2077      Tennessee                         Dyer      38054   18422   19632
## 2078       Missouri               Cape Girardeau      77606   37569   40037
## 2079        Georgia                        Peach      27086   13112   13974
## 2080    Mississippi                        Jones      68276   33051   35225
## 2081        Alabama                    Covington      37886   18339   19547
## 2082       New York                     Richmond     472481  228703  243778
## 2083       Illinois                   Stephenson      46625   22568   24057
## 2084       Kentucky                     Woodford      25317   12254   13063
## 2085       Kentucky                    Henderson      46396   22456   23940
## 2086        Alabama                     Franklin      31634   15311   16323
## 2087 North Carolina                      Haywood      59170   28638   30532
## 2088       Virginia                      Madison      13147    6363    6784
## 2089 North Carolina                      Robeson     134871   65273   69598
## 2090  West Virginia                      Hancock      30201   14616   15585
## 2091       Arkansas                      Garland      96954   46921   50033
## 2092           Ohio                        Clark     136827   66217   70610
## 2093 South Carolina                        Aiken     163908   79318   84590
## 2094       Missouri                       Howell      40326   19514   20812
## 2095       New York                       Albany     307463  148776  158687
## 2096        Alabama                       Etowah     103766   50207   53559
## 2097 South Carolina                      Calhoun      14958    7237    7721
## 2098       Delaware                   New Castle     549643  265915  283728
## 2099   Pennsylvania                      Montour      18508    8954    9554
## 2100     New Jersey                     Atlantic     275376  133217  142159
## 2101 North Carolina                  Northampton      21011   10164   10847
## 2102   Pennsylvania                      Dauphin     271094  131140  139954
## 2103        Florida                      Alachua     254218  122968  131250
## 2104          Texas                       Lavaca      19549    9456   10093
## 2105        Florida                     Seminole     437346  211546  225800
## 2106       New York                         Erie     921584  445761  475823
## 2107     New Jersey                       Bergen     926330  448049  478281
## 2108  West Virginia                        Mason      27177   13144   14033
## 2109       Missouri                      Jackson     680905  329297  351608
## 2110        Indiana                         Lake     491596  237712  253884
## 2111        Georgia                     Spalding      63873   30884   32989
## 2112        Indiana                  Vanderburgh     181305   87661   93644
## 2113        Georgia                    Jefferson      16374    7916    8458
## 2114        Florida                   Palm Beach    1378806  666577  712229
## 2115       Missouri                       Vernon      20878   10093   10785
## 2116  West Virginia                         Wood      86559   41844   44715
## 2117        Alabama                        Lamar      14133    6832    7301
## 2118        Florida                         Lake     310561  150126  160435
## 2119       Missouri               St. Louis city     317850  153641  164209
## 2120          Texas                        Smith     217552  105157  112395
## 2121       Arkansas                        Union      40633   19640   20993
## 2122       Missouri                        Boone     170770   82533   88237
## 2123  Massachusetts                      Suffolk     758919  366767  392152
## 2124       Oklahoma                   Washington      51760   25011   26749
## 2125       Missouri                       Butler      42951   20753   22198
## 2126     New Jersey                       Camden     511998  247370  264628
## 2127        Florida                      Manatee     343729  166067  177662
## 2128        Alabama                      Colbert      54444   26303   28141
## 2129       New York                       Monroe     749356  362021  387335
## 2130      Wisconsin                    Milwaukee     955939  461804  494135
## 2131      Louisiana                      Rapides     132225   63862   68363
## 2132      Louisiana                       Acadia      62163   30023   32140
## 2133       New York                     Onondaga     468304  226176  242128
## 2134       Arkansas                    Hempstead      22336   10787   11549
## 2135       Virginia         Charlottesville city      45084   21771   23313
## 2136   Rhode Island                         Kent     164958   79650   85308
## 2137   Pennsylvania                   Lackawanna     213459  103058  110401
## 2138        Alabama                   Tuscaloosa     200458   96781  103677
## 2139       New York                  Westchester     967315  466996  500319
## 2140       Kentucky                      Lincoln      24498   11827   12671
## 2141        Georgia                      Chatham     279290  134833  144457
## 2142   Pennsylvania                     Lawrence      89162   43042   46120
## 2143       Arkansas                       Baxter      41040   19811   21229
## 2144       Virginia                  Mecklenburg      31555   15232   16323
## 2145 North Carolina                   Rockingham      92300   44553   47747
## 2146       Kentucky                     Caldwell      12826    6191    6635
## 2147        Georgia                        Towns      10800    5213    5587
## 2148       Maryland                      Charles     152754   73731   79023
## 2149          Texas                       Panola      23900   11536   12364
## 2150 South Carolina                         York     240076  115873  124203
## 2151 South Carolina                     Anderson     191215   92287   98928
## 2152  Massachusetts                    Berkshire     129288   62393   66895
## 2153       Virginia              Alexandria city     149315   72057   77258
## 2154       Kentucky                    Jefferson     755809  364719  391090
## 2155      Louisiana                    Vermilion      59110   28523   30587
## 2156       Virginia               Northumberland      12304    5937    6367
## 2157 North Carolina                         Polk      20327    9808   10519
## 2158        Georgia                        Worth      21156   10208   10948
## 2159        Florida                       Citrus     139654   67381   72273
## 2160    Mississippi                       Rankin     146761   70809   75952
## 2161       Nebraska                 Scotts Bluff      36684   17699   18985
## 2162    Mississippi                       Panola      34373   16584   17789
## 2163          Texas                         Kerr      50149   24194   25955
## 2164        Alabama                      Calhoun     116648   56274   60374
## 2165       Illinois                       Massac      15016    7244    7772
## 2166      Tennessee                        Maury      84089   40561   43528
## 2167          Texas                      Cameron     417947  201583  216364
## 2168 North Carolina                  New Hanover     213091  102777  110314
## 2169       Kentucky                     Calloway      38106   18379   19727
## 2170        Georgia                         Tift      40787   19671   21116
## 2171        Indiana                       Howard      82765   39916   42849
## 2172       Virginia                 Hampton city     137081   66111   70970
## 2173      Tennessee                     Davidson     658506  317582  340924
## 2174       Missouri                      Dunklin      31562   15220   16342
## 2175     New Mexico                     McKinley      73998   35681   38317
## 2176     California                       Madera     153187   73863   79324
## 2177        Georgia                     Stephens      25620   12352   13268
## 2178    Mississippi                   Tishomingo      19539    9420   10119
## 2179    Connecticut                    New Haven     862224  415679  446545
## 2180       Illinois                    St. Clair     267029  128729  138300
## 2181       Delaware                         Kent     169509   81716   87793
## 2182       Michigan                      Genesee     415874  200473  215401
## 2183      Tennessee                     Hamilton     348121  167812  180309
## 2184    Mississippi                       Copiah      28921   13941   14980
## 2185  Massachusetts                      Hampden     468041  225611  242430
## 2186        Georgia                        Troup      68867   33189   35678
## 2187        Indiana                       Marion     926335  446372  479963
## 2188       Illinois                        Coles      53037   25550   27487
## 2189    Mississippi                      Calhoun      14789    7124    7665
## 2190       Virginia                 Suffolk city      86184   41511   44673
## 2191    Mississippi                        Smith      16257    7830    8427
## 2192      Louisiana                    St. James      21650   10426   11224
## 2193  Massachusetts                        Essex     763849  367791  396058
## 2194  West Virginia                      Kanawha     190781   91860   98921
## 2195       Maryland                   Montgomery    1017859  490093  527766
## 2196       Kentucky                       Shelby      44290   21324   22966
## 2197 North Carolina                  Mecklenburg     990288  476765  513523
## 2198          Maine                    Sagadahoc      35092   16894   18198
## 2199        Georgia                         Cook      17033    8200    8833
## 2200       Virginia                 Radford city      17057    8211    8846
## 2201       Virginia                      Amherst      32148   15475   16673
## 2202        Florida                 Indian River     142866   68768   74098
## 2203       Virginia                 Chesterfield     328176  157960  170216
## 2204   Pennsylvania                    Allegheny    1231145  592511  638634
## 2205        Georgia                        Burke      23007   11071   11936
## 2206       Virginia                 Roanoke city      98736   47511   51225
## 2207       Maryland              Prince George's     892816  429603  463213
## 2208           Iowa                   Montgomery      10465    5035    5430
## 2209    Mississippi                    Chickasaw      17391    8367    9024
## 2210   South Dakota                Oglala Lakota      14153    6809    7344
## 2211         Oregon                         Polk      77264   37171   40093
## 2212   Pennsylvania                     Columbia      66912   32181   34731
## 2213 North Carolina                     Buncombe     247336  118950  128386
## 2214           Ohio                   Montgomery     533763  256680  277083
## 2215   Rhode Island                      Bristol      49176   23646   25530
## 2216       Virginia              Portsmouth city      96135   46221   49914
## 2217        Georgia                      Douglas     136520   65637   70883
## 2218       Virginia                        Henry      52580   25278   27302
## 2219 North Carolina                        Swain      14163    6808    7355
## 2220           Ohio                     Hamilton     804194  386561  417633
## 2221       Michigan                        Wayne    1778969  855112  923857
## 2222      Tennessee                    Henderson      28013   13465   14548
## 2223 North Carolina                         Nash      94722   45521   49201
## 2224          Texas                       Morris      12700    6103    6597
## 2225          Texas                      Kendall      37361   17951   19410
## 2226           Iowa                    Poweshiek      18705    8987    9718
## 2227        Florida                     Hernando     174809   83978   90831
## 2228     New Jersey                        Ocean     583450  280258  303192
## 2229     New Jersey                        Essex     791609  380222  411387
## 2230      Minnesota                       Waseca      19076    9162    9914
## 2231 North Carolina                       Chowan      14656    7039    7617
## 2232 North Carolina                        Moore      91743   44062   47681
## 2233       Missouri                       Marion      28840   13850   14990
## 2234       Illinois                     Sangamon     199016   95552  103464
## 2235      Tennessee                        Obion      31129   14945   16184
## 2236        Florida                       Marion     336811  161696  175115
## 2237        Alabama                      Conecuh      12865    6176    6689
## 2238      Louisiana                      De Soto      26965   12944   14021
## 2239 South Carolina                       Sumter     107777   51732   56045
## 2240       Arkansas                        Cross      17467    8384    9083
## 2241    Mississippi                      Neshoba      29553   14185   15368
## 2242      Louisiana                   St. Landry      83613   40129   43484
## 2243   Pennsylvania                     Delaware     561683  269512  292171
## 2244        Indiana                     Delaware     117335   56294   61041
## 2245       Arkansas                       Ashley      21229   10185   11044
## 2246    Mississippi                      Noxubee      11143    5346    5797
## 2247    Mississippi                    Covington      19471    9341   10130
## 2248 North Carolina                      Halifax      53407   25620   27787
## 2249 North Carolina                       Durham     288817  138537  150280
## 2250       Arkansas                      Pulaski     390463  187292  203171
## 2251  West Virginia                         Ohio      43637   20931   22706
## 2252        Florida                     Pinellas     931477  446740  484737
## 2253         Kansas                     Atchison      16633    7977    8656
## 2254      Louisiana             East Baton Rouge     444690  213266  231424
## 2255       Missouri                   New Madrid      18411    8828    9583
## 2256 North Carolina                    Cleveland      97178   46587   50591
## 2257        Arizona                   Santa Cruz      47073   22566   24507
## 2258       Missouri                       Grundy      10256    4916    5340
## 2259       Kentucky                        Rowan      23608   11316   12292
## 2260     Washington                       Asotin      22040   10564   11476
## 2261          Texas                  Nacogdoches      65531   31409   34122
## 2262       Illinois                        Macon     109193   52336   56857
## 2263        Georgia                        Henry     211512  101340  110172
## 2264  Massachusetts                      Norfolk     687721  329431  358290
## 2265    Mississippi                      Lincoln      34765   16653   18112
## 2266     Washington                     San Juan      15956    7643    8313
## 2267      Louisiana                    Bienville      13977    6695    7282
## 2268        Alabama                   Lauderdale      92737   44419   48318
## 2269       Virginia            Falls Church city      13308    6374    6934
## 2270      Tennessee                      Chester      17324    8297    9027
## 2271        Alabama                       Monroe      22217   10639   11578
## 2272       Kentucky                    McCracken      65408   31320   34088
## 2273        Indiana                        Grant      68896   32990   35906
## 2274 South Carolina                        Union      28125   13467   14658
## 2275 South Carolina                     Colleton      38004   18196   19808
## 2276        Georgia                        Upson      26528   12701   13827
## 2277       Virginia                    Albemarle     103108   49365   53743
## 2278        Alabama                        Henry      17252    8259    8993
## 2279        Alabama                      Houston     103534   49563   53971
## 2280        Indiana                    Jefferson      32453   15535   16918
## 2281      Louisiana                     Ouachita     155769   74555   81214
## 2282 South Carolina                    Fairfield      23108   11060   12048
## 2283        Florida                      Flagler     100783   48229   52554
## 2284        Alabama                       Wilcox      11235    5376    5859
## 2285        Georgia                     Bleckley      12746    6099    6647
## 2286        Alabama                       Mobile     414251  198216  216035
## 2287           Iowa                        Union      12575    6015    6560
## 2288      Louisiana                      Orleans     376738  180183  196555
## 2289        Georgia                   Meriwether      21297   10185   11112
## 2290    Mississippi                      Madison     100202   47918   52284
## 2291          Texas                       Marion      10248    4900    5348
## 2292      Louisiana                Pointe Coupee      22498   10757   11741
## 2293       Virginia                      Roanoke      93633   44757   48876
## 2294       Maryland                   Dorchester      32534   15542   16992
## 2295      Louisiana                    Morehouse      27012   12904   14108
## 2296        Georgia                     Ben Hill      17477    8349    9128
## 2297        Georgia                        Lamar      18114    8653    9461
## 2298        Florida                     Sarasota     392038  187242  204796
## 2299    Mississippi                      Leflore      31516   15051   16465
## 2300    Mississippi                       Monroe      36175   17273   18902
## 2301 North Carolina                       Orange     138644   66199   72445
## 2302           Ohio                        Union      53470   25523   27947
## 2303      Louisiana                 Natchitoches      39330   18765   20565
## 2304          Texas                        Llano      19323    9219   10104
## 2305        Alabama                      Lowndes      10742    5125    5617
## 2306        Alabama                         Pike      33155   15818   17337
## 2307       Arkansas                     Arkansas      18731    8936    9795
## 2308        Alabama                     Chambers      34079   16258   17821
## 2309    Mississippi                      Lowndes      59699   28469   31230
## 2310 South Carolina                      Bamberg      15432    7359    8073
## 2311      Tennessee                       Shelby     937750  447161  490589
## 2312 North Carolina                         Clay      10656    5081    5575
## 2313  West Virginia                       Mercer      61891   29509   32382
## 2314 North Carolina                     Alamance     155258   74019   81239
## 2315       Virginia              Waynesboro city      21150   10083   11067
## 2316       Oklahoma                 Pottawatomie      71136   33906   37230
## 2317        Georgia                       DeKalb     716331  341362  374969
## 2318        Georgia                       Morgan      17900    8530    9370
## 2319    Mississippi                       Holmes      18772    8945    9827
## 2320        Alabama                      Choctaw      13395    6382    7013
## 2321       Maryland                     Wicomico     101182   48199   52983
## 2322        Georgia                      Laurens      47886   22810   25076
## 2323    Mississippi                          Lee      85036   40489   44547
## 2324 North Carolina                     Beaufort      47561   22645   24916
## 2325        Alabama                   Tallapoosa      41153   19593   21560
## 2326       Missouri                         Linn      12401    5904    6497
## 2327       Virginia                      Halifax      35506   16904   18602
## 2328  Massachusetts                   Barnstable     214766  102244  112522
## 2329        Georgia                     McDuffie      21582   10274   11308
## 2330       Virginia            Harrisonburg city      51388   24462   26926
## 2331    Mississippi              Jefferson Davis      11941    5684    6257
## 2332    Mississippi                      Grenada      21591   10275   11316
## 2333 North Carolina                       Lenoir      58782   27973   30809
## 2334        Georgia                      Clayton     267234  127166  140068
## 2335       Virginia                Richmond city     213735  101702  112033
## 2336 South Carolina                   Georgetown      60572   28819   31753
## 2337           Ohio                     Cuyahoga    1263189  600864  662325
## 2338    Mississippi                      Forrest      76267   36274   39993
## 2339      Tennessee                       Gibson      49572   23577   25995
## 2340    Mississippi                       Warren      48020   22838   25182
## 2341       Arkansas                     Columbia      24327   11569   12758
## 2342      Louisiana                        Caddo     254742  121129  133613
## 2343 North Carolina                       Wilson      81581   38776   42805
## 2344        Georgia                       Newton     102645   48786   53859
## 2345        Florida                         Leon     282940  134474  148466
## 2346       Missouri                        Adair      25560   12147   13413
## 2347        Georgia                       Clarke     120905   57448   63457
## 2348 North Carolina                     Guilford     506763  240753  266010
## 2349 North Carolina                       Bladen      34720   16491   18229
## 2350        Alabama                   Montgomery     228138  108296  119842
## 2351 North Carolina                      Forsyth     361684  171662  190022
## 2352 North Carolina                   Perquimans      13498    6402    7096
## 2353    Mississippi                         Pike      40075   19003   21072
## 2354       Arkansas                        Clark      22751   10786   11965
## 2355        Georgia                       Elbert      19537    9262   10275
## 2356        Georgia                       Sumter      31429   14894   16535
## 2357        Georgia                       Thomas      44824   21239   23585
## 2358        Florida                      Gadsden      46424   21995   24429
## 2359       Colorado                         Yuma      10185    4825    5360
## 2360       Virginia                   Salem city      25165   11919   13246
## 2361       New York                        Kings    2595259 1229001 1366258
## 2362      Tennessee                      Madison      98184   46495   51689
## 2363       Missouri                    St. Louis    1001327  474150  527177
## 2364        Georgia                     Rockdale      86901   41147   45754
## 2365       Missouri                     Pemiscot      17837    8445    9392
## 2366        Indiana                        Parke      17107    8098    9009
## 2367       Virginia                Hopewell city      22279   10546   11733
## 2368        Georgia                        Crisp      23314   11035   12279
## 2369 South Carolina                       Dillon      31435   14878   16557
## 2370       Maryland                    Baltimore     822959  389418  433541
## 2371    Mississippi                       Clarke      16362    7740    8622
## 2372      Tennessee                     Crockett      14600    6906    7694
## 2373        Alabama                    Jefferson     659026  311581  347445
## 2374        Georgia                        Glynn      81743   38647   43096
## 2375        Georgia                    Habersham      43527   20573   22954
## 2376 South Carolina                   Darlington      67922   32091   35831
## 2377   Pennsylvania                 Philadelphia    1555072  734521  820551
## 2378       New York                     New York    1629507  769434  860073
## 2379        Alabama                       Clarke      25070   11834   13236
## 2380       Virginia                      Henrico     318864  150466  168398
## 2381       Arkansas                   Crittenden      49765   23482   26283
## 2382 North Carolina                         Pitt     173798   82005   91793
## 2383    Mississippi                       Newton      21663   10217   11446
## 2384        Georgia                         Bibb     154816   73011   81805
## 2385 North Carolina                       Martin      23729   11186   12543
## 2386       Maryland               Baltimore city     622454  293366  329088
## 2387       Maryland                       Talbot      37799   17811   19988
## 2388 South Carolina                   Orangeburg      90575   42660   47915
## 2389       Maryland                         Kent      19923    9382   10541
## 2390       Virginia                    Lancaster      11129    5240    5889
## 2391        Alabama                         Hale      15256    7183    8073
## 2392       New York                        Bronx    1428357  672447  755910
## 2393       Arkansas                     Ouachita      25044   11790   13254
## 2394    Mississippi                       Attala      19175    9024   10151
## 2395          Texas                        Falls      17410    8192    9218
## 2396       Virginia               Lynchburg city      78158   36717   41441
## 2397    Mississippi                      Bolivar      33803   15874   17929
## 2398    Mississippi                       Tunica      10477    4916    5561
## 2399    Mississippi                   Montgomery      10464    4907    5557
## 2400    Mississippi                   Washington      49499   23204   26295
## 2401    Mississippi                        Hinds     245874  115233  130641
## 2402        Georgia                       Toombs      27210   12743   14467
## 2403  Massachusetts                    Hampshire     160759   75286   85473
## 2404      Tennessee                      Haywood      18248    8544    9704
## 2405 North Carolina                        Vance      44829   20984   23845
## 2406 South Carolina                     Florence     138330   64688   73642
## 2407       Arkansas                     Phillips      20391    9529   10862
## 2408        Alabama                       Butler      20354    9502   10852
## 2409 South Carolina                    Greenwood      69771   32567   37204
## 2410       Virginia          Fredericksburg city      27395   12783   14612
## 2411 South Carolina                     Barnwell      22098   10290   11808
## 2412       Virginia                 Bristol city      17524    8158    9366
## 2413        Alabama                      Marengo      20306    9452   10854
## 2414        Georgia                    Dougherty      93310   43432   49878
## 2415    Mississippi                         Clay      20279    9435   10844
## 2416       Virginia        Colonial Heights city      17515    8142    9373
## 2417 North Carolina                    Edgecombe      55280   25693   29587
## 2418        Georgia                        Evans      10814    5024    5790
## 2419       Arkansas                        Desha      12379    5741    6638
## 2420        Alabama                        Perry      10038    4651    5387
## 2421        Georgia                        Early      10579    4899    5680
## 2422       Virginia                        Essex      11151    5158    5993
## 2423    Mississippi                      Coahoma      25254   11681   13573
## 2424        Alabama                       Dallas      42154   19450   22704
## 2425 North Carolina                   Washington      12668    5828    6840
## 2426       Virginia              Petersburg city      32123   14773   17350
## 2427 South Carolina                       Marion      32167   14778   17389
## 2428       Virginia                     Fluvanna      26014   11924   14090
## 2429        Alabama                        Macon      20018    9166   10852
## 2430       Virginia                Danville city      42450   19411   23039
## 2431  West Virginia                      Summers      13544    6185    7359
## 2432       Missouri                      Audrain      25783   11730   14053
## 2433       Virginia            Martinsville city      13624    6155    7469
## 2434       Virginia            Williamsburg city      14754    6665    8089
## 2435       Missouri                   Livingston      15042    6787    8255
## 2436       Virginia                Staunton city      24193   10861   13332
## 2437        Alabama                       Sumter      13341    5905    7436
## 2438        Georgia                      Pulaski      11590    4866    6724
##      proportion_men
## 1         0.6852664
## 2         0.6683412
## 3         0.6664428
## 4         0.6635096
## 5         0.6470937
## 6         0.6332966
## 7         0.6321389
## 8         0.6249458
## 9         0.6210034
## 10        0.6124320
## 11        0.6112928
## 12        0.6038764
## 13        0.6023619
## 14        0.5998800
## 15        0.5954275
## 16        0.5951589
## 17        0.5913704
## 18        0.5902136
## 19        0.5850491
## 20        0.5847953
## 21        0.5839319
## 22        0.5788255
## 23        0.5768762
## 24        0.5763069
## 25        0.5761649
## 26        0.5742234
## 27        0.5723902
## 28        0.5695578
## 29        0.5686432
## 30        0.5676776
## 31        0.5664510
## 32        0.5663861
## 33        0.5651515
## 34        0.5644968
## 35        0.5634119
## 36        0.5609382
## 37        0.5600125
## 38        0.5578900
## 39        0.5574929
## 40        0.5568216
## 41        0.5565147
## 42        0.5560206
## 43        0.5557755
## 44        0.5551910
## 45        0.5548438
## 46        0.5547875
## 47        0.5546830
## 48        0.5539035
## 49        0.5514617
## 50        0.5509555
## 51        0.5506250
## 52        0.5500977
## 53        0.5498272
## 54        0.5497941
## 55        0.5496618
## 56        0.5488540
## 57        0.5484001
## 58        0.5482818
## 59        0.5479140
## 60        0.5472411
## 61        0.5468165
## 62        0.5467670
## 63        0.5466516
## 64        0.5466345
## 65        0.5465288
## 66        0.5459051
## 67        0.5446497
## 68        0.5445854
## 69        0.5435337
## 70        0.5431775
## 71        0.5429961
## 72        0.5424221
## 73        0.5421835
## 74        0.5421713
## 75        0.5418414
## 76        0.5416293
## 77        0.5414757
## 78        0.5413529
## 79        0.5409380
## 80        0.5408069
## 81        0.5406977
## 82        0.5404425
## 83        0.5402019
## 84        0.5399146
## 85        0.5394846
## 86        0.5394472
## 87        0.5391504
## 88        0.5391001
## 89        0.5387404
## 90        0.5384320
## 91        0.5384291
## 92        0.5382816
## 93        0.5380133
## 94        0.5378221
## 95        0.5376388
## 96        0.5374198
## 97        0.5371947
## 98        0.5371245
## 99        0.5368768
## 100       0.5365331
## 101       0.5363522
## 102       0.5360565
## 103       0.5359692
## 104       0.5359155
## 105       0.5352167
## 106       0.5351466
## 107       0.5350834
## 108       0.5349581
## 109       0.5348103
## 110       0.5344593
## 111       0.5344191
## 112       0.5341090
## 113       0.5340886
## 114       0.5335508
## 115       0.5333476
## 116       0.5332036
## 117       0.5331332
## 118       0.5327901
## 119       0.5323549
## 120       0.5310950
## 121       0.5310727
## 122       0.5309632
## 123       0.5305823
## 124       0.5304217
## 125       0.5304131
## 126       0.5303320
## 127       0.5302109
## 128       0.5301378
## 129       0.5301240
## 130       0.5300618
## 131       0.5297929
## 132       0.5296249
## 133       0.5295653
## 134       0.5295090
## 135       0.5294619
## 136       0.5294237
## 137       0.5291094
## 138       0.5291009
## 139       0.5290971
## 140       0.5289940
## 141       0.5288492
## 142       0.5285089
## 143       0.5284615
## 144       0.5283722
## 145       0.5277350
## 146       0.5271739
## 147       0.5270030
## 148       0.5269368
## 149       0.5265921
## 150       0.5264831
## 151       0.5261470
## 152       0.5260464
## 153       0.5260160
## 154       0.5256343
## 155       0.5255952
## 156       0.5255599
## 157       0.5254566
## 158       0.5254414
## 159       0.5253446
## 160       0.5252918
## 161       0.5252822
## 162       0.5250271
## 163       0.5248528
## 164       0.5247691
## 165       0.5246308
## 166       0.5246058
## 167       0.5245973
## 168       0.5244272
## 169       0.5242114
## 170       0.5239756
## 171       0.5239334
## 172       0.5238287
## 173       0.5238276
## 174       0.5235329
## 175       0.5232634
## 176       0.5231517
## 177       0.5229947
## 178       0.5229878
## 179       0.5228763
## 180       0.5228104
## 181       0.5226911
## 182       0.5226050
## 183       0.5223119
## 184       0.5222401
## 185       0.5222361
## 186       0.5220009
## 187       0.5218087
## 188       0.5217922
## 189       0.5217323
## 190       0.5217126
## 191       0.5215514
## 192       0.5213063
## 193       0.5212531
## 194       0.5212382
## 195       0.5211758
## 196       0.5211168
## 197       0.5210358
## 198       0.5209993
## 199       0.5209109
## 200       0.5207413
## 201       0.5207169
## 202       0.5206983
## 203       0.5206695
## 204       0.5204873
## 205       0.5201897
## 206       0.5201440
## 207       0.5200542
## 208       0.5196154
## 209       0.5195396
## 210       0.5193536
## 211       0.5191634
## 212       0.5190653
## 213       0.5190373
## 214       0.5189011
## 215       0.5187897
## 216       0.5187849
## 217       0.5186055
## 218       0.5185934
## 219       0.5185811
## 220       0.5183810
## 221       0.5183115
## 222       0.5182827
## 223       0.5181774
## 224       0.5181150
## 225       0.5180965
## 226       0.5180031
## 227       0.5179666
## 228       0.5178788
## 229       0.5178339
## 230       0.5177893
## 231       0.5177740
## 232       0.5177735
## 233       0.5176723
## 234       0.5174872
## 235       0.5173886
## 236       0.5173870
## 237       0.5173583
## 238       0.5171760
## 239       0.5170992
## 240       0.5169967
## 241       0.5169474
## 242       0.5168245
## 243       0.5166734
## 244       0.5166505
## 245       0.5166247
## 246       0.5165851
## 247       0.5164770
## 248       0.5162673
## 249       0.5162373
## 250       0.5161588
## 251       0.5161032
## 252       0.5159773
## 253       0.5159275
## 254       0.5158501
## 255       0.5158149
## 256       0.5158096
## 257       0.5157976
## 258       0.5157361
## 259       0.5156958
## 260       0.5156585
## 261       0.5155546
## 262       0.5155197
## 263       0.5154802
## 264       0.5154534
## 265       0.5154138
## 266       0.5154048
## 267       0.5153124
## 268       0.5152596
## 269       0.5152477
## 270       0.5152178
## 271       0.5152062
## 272       0.5150806
## 273       0.5149438
## 274       0.5148590
## 275       0.5147941
## 276       0.5146698
## 277       0.5146515
## 278       0.5146266
## 279       0.5145922
## 280       0.5145409
## 281       0.5144950
## 282       0.5143199
## 283       0.5139574
## 284       0.5137651
## 285       0.5137601
## 286       0.5136881
## 287       0.5136866
## 288       0.5136833
## 289       0.5136737
## 290       0.5136165
## 291       0.5135481
## 292       0.5134903
## 293       0.5134066
## 294       0.5133379
## 295       0.5132464
## 296       0.5132111
## 297       0.5131787
## 298       0.5129367
## 299       0.5128904
## 300       0.5128054
## 301       0.5128005
## 302       0.5124979
## 303       0.5124603
## 304       0.5122870
## 305       0.5120400
## 306       0.5119475
## 307       0.5119305
## 308       0.5118534
## 309       0.5118022
## 310       0.5117515
## 311       0.5117336
## 312       0.5116785
## 313       0.5116658
## 314       0.5116365
## 315       0.5116082
## 316       0.5115821
## 317       0.5113533
## 318       0.5112943
## 319       0.5112593
## 320       0.5112399
## 321       0.5112366
## 322       0.5112236
## 323       0.5111366
## 324       0.5111271
## 325       0.5110881
## 326       0.5110526
## 327       0.5110230
## 328       0.5109670
## 329       0.5109421
## 330       0.5109401
## 331       0.5109145
## 332       0.5109131
## 333       0.5108860
## 334       0.5108345
## 335       0.5108102
## 336       0.5106574
## 337       0.5105767
## 338       0.5105374
## 339       0.5104448
## 340       0.5104241
## 341       0.5103516
## 342       0.5103273
## 343       0.5103000
## 344       0.5102475
## 345       0.5101043
## 346       0.5101024
## 347       0.5100665
## 348       0.5100553
## 349       0.5100489
## 350       0.5099786
## 351       0.5098862
## 352       0.5098460
## 353       0.5097759
## 354       0.5097444
## 355       0.5097314
## 356       0.5095779
## 357       0.5095331
## 358       0.5095298
## 359       0.5094756
## 360       0.5094464
## 361       0.5094452
## 362       0.5094304
## 363       0.5094242
## 364       0.5094170
## 365       0.5094111
## 366       0.5093015
## 367       0.5092691
## 368       0.5092287
## 369       0.5091699
## 370       0.5091623
## 371       0.5091308
## 372       0.5091231
## 373       0.5091041
## 374       0.5089959
## 375       0.5089532
## 376       0.5089411
## 377       0.5088537
## 378       0.5087705
## 379       0.5086792
## 380       0.5086715
## 381       0.5086589
## 382       0.5086005
## 383       0.5085503
## 384       0.5085360
## 385       0.5085348
## 386       0.5085194
## 387       0.5085175
## 388       0.5084354
## 389       0.5084184
## 390       0.5083992
## 391       0.5083696
## 392       0.5082854
## 393       0.5082543
## 394       0.5081701
## 395       0.5081005
## 396       0.5080613
## 397       0.5080311
## 398       0.5079346
## 399       0.5079184
## 400       0.5078866
## 401       0.5078786
## 402       0.5078781
## 403       0.5077948
## 404       0.5077667
## 405       0.5077371
## 406       0.5077287
## 407       0.5076558
## 408       0.5076496
## 409       0.5076346
## 410       0.5076263
## 411       0.5075553
## 412       0.5075441
## 413       0.5075341
## 414       0.5074673
## 415       0.5073644
## 416       0.5073598
## 417       0.5072705
## 418       0.5072343
## 419       0.5072106
## 420       0.5071908
## 421       0.5071643
## 422       0.5071618
## 423       0.5071256
## 424       0.5070698
## 425       0.5070202
## 426       0.5070200
## 427       0.5070100
## 428       0.5069612
## 429       0.5069419
## 430       0.5069396
## 431       0.5069327
## 432       0.5069131
## 433       0.5069008
## 434       0.5068722
## 435       0.5068372
## 436       0.5068087
## 437       0.5067136
## 438       0.5066906
## 439       0.5066838
## 440       0.5065907
## 441       0.5065482
## 442       0.5064425
## 443       0.5064271
## 444       0.5063655
## 445       0.5062918
## 446       0.5062576
## 447       0.5062165
## 448       0.5061954
## 449       0.5061937
## 450       0.5061853
## 451       0.5061479
## 452       0.5061371
## 453       0.5061282
## 454       0.5061216
## 455       0.5061179
## 456       0.5060972
## 457       0.5060965
## 458       0.5060788
## 459       0.5060666
## 460       0.5060528
## 461       0.5059973
## 462       0.5059815
## 463       0.5059400
## 464       0.5059344
## 465       0.5059151
## 466       0.5058097
## 467       0.5057832
## 468       0.5056773
## 469       0.5056408
## 470       0.5055937
## 471       0.5055838
## 472       0.5055762
## 473       0.5055528
## 474       0.5055215
## 475       0.5055056
## 476       0.5054765
## 477       0.5054336
## 478       0.5054121
## 479       0.5053508
## 480       0.5053316
## 481       0.5053121
## 482       0.5052778
## 483       0.5051808
## 484       0.5051435
## 485       0.5051265
## 486       0.5051137
## 487       0.5050864
## 488       0.5050742
## 489       0.5050641
## 490       0.5050152
## 491       0.5048626
## 492       0.5047981
## 493       0.5047930
## 494       0.5047869
## 495       0.5047856
## 496       0.5047776
## 497       0.5047775
## 498       0.5047707
## 499       0.5047669
## 500       0.5047577
## 501       0.5047498
## 502       0.5046857
## 503       0.5046607
## 504       0.5046016
## 505       0.5045936
## 506       0.5045834
## 507       0.5045818
## 508       0.5044663
## 509       0.5043851
## 510       0.5043529
## 511       0.5043322
## 512       0.5042802
## 513       0.5042754
## 514       0.5042271
## 515       0.5041870
## 516       0.5041789
## 517       0.5041551
## 518       0.5041136
## 519       0.5040743
## 520       0.5040456
## 521       0.5040308
## 522       0.5040063
## 523       0.5040051
## 524       0.5039691
## 525       0.5039688
## 526       0.5039472
## 527       0.5039466
## 528       0.5038691
## 529       0.5038411
## 530       0.5037218
## 531       0.5037115
## 532       0.5037057
## 533       0.5036992
## 534       0.5036839
## 535       0.5036718
## 536       0.5036606
## 537       0.5036380
## 538       0.5036374
## 539       0.5036373
## 540       0.5036028
## 541       0.5035938
## 542       0.5035843
## 543       0.5035225
## 544       0.5035182
## 545       0.5035137
## 546       0.5034782
## 547       0.5034655
## 548       0.5034644
## 549       0.5034349
## 550       0.5034268
## 551       0.5034125
## 552       0.5034113
## 553       0.5034069
## 554       0.5034039
## 555       0.5033945
## 556       0.5033930
## 557       0.5033874
## 558       0.5033651
## 559       0.5033393
## 560       0.5033192
## 561       0.5033175
## 562       0.5032969
## 563       0.5032880
## 564       0.5032553
## 565       0.5032303
## 566       0.5031918
## 567       0.5031419
## 568       0.5031385
## 569       0.5031197
## 570       0.5031113
## 571       0.5031037
## 572       0.5030966
## 573       0.5030961
## 574       0.5030953
## 575       0.5030831
## 576       0.5030813
## 577       0.5030626
## 578       0.5030591
## 579       0.5030326
## 580       0.5030209
## 581       0.5030115
## 582       0.5029550
## 583       0.5029461
## 584       0.5029432
## 585       0.5029253
## 586       0.5029166
## 587       0.5027902
## 588       0.5027857
## 589       0.5027824
## 590       0.5027785
## 591       0.5027567
## 592       0.5027070
## 593       0.5026980
## 594       0.5026967
## 595       0.5026859
## 596       0.5026832
## 597       0.5026738
## 598       0.5026556
## 599       0.5026470
## 600       0.5026387
## 601       0.5026178
## 602       0.5026145
## 603       0.5026120
## 604       0.5026117
## 605       0.5025321
## 606       0.5025168
## 607       0.5025126
## 608       0.5024780
## 609       0.5024649
## 610       0.5024597
## 611       0.5024341
## 612       0.5024334
## 613       0.5024060
## 614       0.5023555
## 615       0.5023530
## 616       0.5023492
## 617       0.5023326
## 618       0.5023252
## 619       0.5023214
## 620       0.5022950
## 621       0.5022929
## 622       0.5022679
## 623       0.5022392
## 624       0.5022286
## 625       0.5022135
## 626       0.5022090
## 627       0.5022052
## 628       0.5021799
## 629       0.5021494
## 630       0.5021167
## 631       0.5021128
## 632       0.5020609
## 633       0.5020518
## 634       0.5020486
## 635       0.5020364
## 636       0.5020240
## 637       0.5019765
## 638       0.5019727
## 639       0.5019647
## 640       0.5019646
## 641       0.5019485
## 642       0.5019339
## 643       0.5019322
## 644       0.5019066
## 645       0.5018962
## 646       0.5018915
## 647       0.5018868
## 648       0.5018817
## 649       0.5018757
## 650       0.5018008
## 651       0.5017847
## 652       0.5017583
## 653       0.5017397
## 654       0.5017389
## 655       0.5017164
## 656       0.5017071
## 657       0.5017052
## 658       0.5016477
## 659       0.5016276
## 660       0.5015938
## 661       0.5015899
## 662       0.5015881
## 663       0.5015842
## 664       0.5015052
## 665       0.5014968
## 666       0.5014751
## 667       0.5014693
## 668       0.5014643
## 669       0.5014491
## 670       0.5014079
## 671       0.5013998
## 672       0.5013893
## 673       0.5013889
## 674       0.5013864
## 675       0.5013752
## 676       0.5013570
## 677       0.5013450
## 678       0.5013018
## 679       0.5012803
## 680       0.5012484
## 681       0.5012294
## 682       0.5011989
## 683       0.5011756
## 684       0.5011515
## 685       0.5011275
## 686       0.5011147
## 687       0.5010829
## 688       0.5010571
## 689       0.5010530
## 690       0.5010518
## 691       0.5010450
## 692       0.5010443
## 693       0.5010340
## 694       0.5010326
## 695       0.5010066
## 696       0.5009772
## 697       0.5009745
## 698       0.5009657
## 699       0.5009462
## 700       0.5009338
## 701       0.5009043
## 702       0.5009019
## 703       0.5008982
## 704       0.5008972
## 705       0.5008916
## 706       0.5008848
## 707       0.5008788
## 708       0.5008532
## 709       0.5008211
## 710       0.5008205
## 711       0.5008176
## 712       0.5007753
## 713       0.5007429
## 714       0.5007305
## 715       0.5007302
## 716       0.5007225
## 717       0.5007112
## 718       0.5007082
## 719       0.5006957
## 720       0.5006456
## 721       0.5006446
## 722       0.5006438
## 723       0.5006395
## 724       0.5006376
## 725       0.5006292
## 726       0.5006283
## 727       0.5006246
## 728       0.5006237
## 729       0.5006128
## 730       0.5006064
## 731       0.5006022
## 732       0.5005843
## 733       0.5005735
## 734       0.5005687
## 735       0.5005593
## 736       0.5005514
## 737       0.5005266
## 738       0.5005220
## 739       0.5005117
## 740       0.5005062
## 741       0.5004967
## 742       0.5004858
## 743       0.5004856
## 744       0.5004822
## 745       0.5004638
## 746       0.5004316
## 747       0.5003925
## 748       0.5003831
## 749       0.5003231
## 750       0.5003182
## 751       0.5003169
## 752       0.5003078
## 753       0.5002901
## 754       0.5002533
## 755       0.5002498
## 756       0.5002357
## 757       0.5002185
## 758       0.5001786
## 759       0.5001719
## 760       0.5001496
## 761       0.5001419
## 762       0.5001300
## 763       0.5001235
## 764       0.5001207
## 765       0.5001185
## 766       0.5000605
## 767       0.5000600
## 768       0.5000565
## 769       0.5000523
## 770       0.5000000
## 771       0.4999899
## 772       0.4999790
## 773       0.4999687
## 774       0.4999507
## 775       0.4999468
## 776       0.4999396
## 777       0.4999301
## 778       0.4999235
## 779       0.4999230
## 780       0.4999176
## 781       0.4999143
## 782       0.4999044
## 783       0.4998700
## 784       0.4998570
## 785       0.4998507
## 786       0.4997968
## 787       0.4997899
## 788       0.4997580
## 789       0.4997570
## 790       0.4997552
## 791       0.4997356
## 792       0.4997320
## 793       0.4996710
## 794       0.4996673
## 795       0.4996505
## 796       0.4996454
## 797       0.4996346
## 798       0.4996335
## 799       0.4996269
## 800       0.4996069
## 801       0.4995799
## 802       0.4995502
## 803       0.4995360
## 804       0.4995215
## 805       0.4995089
## 806       0.4994999
## 807       0.4994824
## 808       0.4994768
## 809       0.4994749
## 810       0.4994711
## 811       0.4994411
## 812       0.4993978
## 813       0.4993395
## 814       0.4993258
## 815       0.4992801
## 816       0.4992602
## 817       0.4992368
## 818       0.4992315
## 819       0.4992055
## 820       0.4991922
## 821       0.4991895
## 822       0.4991557
## 823       0.4991518
## 824       0.4991205
## 825       0.4991142
## 826       0.4991054
## 827       0.4991006
## 828       0.4990732
## 829       0.4990634
## 830       0.4990626
## 831       0.4990343
## 832       0.4990336
## 833       0.4990323
## 834       0.4989690
## 835       0.4989630
## 836       0.4989445
## 837       0.4989319
## 838       0.4989144
## 839       0.4988961
## 840       0.4988957
## 841       0.4988936
## 842       0.4988902
## 843       0.4988871
## 844       0.4988850
## 845       0.4988747
## 846       0.4988718
## 847       0.4988231
## 848       0.4988169
## 849       0.4987743
## 850       0.4987716
## 851       0.4987344
## 852       0.4987148
## 853       0.4987101
## 854       0.4986945
## 855       0.4986772
## 856       0.4986318
## 857       0.4986224
## 858       0.4986220
## 859       0.4986212
## 860       0.4985887
## 861       0.4985660
## 862       0.4985564
## 863       0.4985476
## 864       0.4985224
## 865       0.4985205
## 866       0.4985099
## 867       0.4984845
## 868       0.4984651
## 869       0.4984592
## 870       0.4984586
## 871       0.4984586
## 872       0.4984562
## 873       0.4984420
## 874       0.4984306
## 875       0.4984251
## 876       0.4984092
## 877       0.4984088
## 878       0.4983733
## 879       0.4983729
## 880       0.4983495
## 881       0.4983380
## 882       0.4983259
## 883       0.4983205
## 884       0.4983203
## 885       0.4983192
## 886       0.4983170
## 887       0.4982993
## 888       0.4982838
## 889       0.4982544
## 890       0.4982520
## 891       0.4982515
## 892       0.4982257
## 893       0.4982117
## 894       0.4981924
## 895       0.4981774
## 896       0.4981627
## 897       0.4981454
## 898       0.4981309
## 899       0.4981303
## 900       0.4981163
## 901       0.4981079
## 902       0.4981010
## 903       0.4980947
## 904       0.4980912
## 905       0.4980853
## 906       0.4980770
## 907       0.4980667
## 908       0.4980578
## 909       0.4980524
## 910       0.4980486
## 911       0.4980267
## 912       0.4980015
## 913       0.4979955
## 914       0.4979845
## 915       0.4979757
## 916       0.4979669
## 917       0.4979565
## 918       0.4979471
## 919       0.4979426
## 920       0.4978980
## 921       0.4978680
## 922       0.4978663
## 923       0.4978479
## 924       0.4978217
## 925       0.4978209
## 926       0.4978094
## 927       0.4978008
## 928       0.4977922
## 929       0.4977666
## 930       0.4977650
## 931       0.4977593
## 932       0.4977571
## 933       0.4977543
## 934       0.4976792
## 935       0.4976618
## 936       0.4976542
## 937       0.4976524
## 938       0.4976195
## 939       0.4976194
## 940       0.4976109
## 941       0.4976069
## 942       0.4975897
## 943       0.4975793
## 944       0.4975764
## 945       0.4975388
## 946       0.4975373
## 947       0.4975309
## 948       0.4975107
## 949       0.4975088
## 950       0.4975032
## 951       0.4974921
## 952       0.4974878
## 953       0.4974663
## 954       0.4974549
## 955       0.4974371
## 956       0.4974347
## 957       0.4974080
## 958       0.4973877
## 959       0.4973707
## 960       0.4973608
## 961       0.4973581
## 962       0.4973509
## 963       0.4973469
## 964       0.4973469
## 965       0.4973225
## 966       0.4973194
## 967       0.4973185
## 968       0.4973091
## 969       0.4973024
## 970       0.4972920
## 971       0.4972795
## 972       0.4972775
## 973       0.4972606
## 974       0.4972467
## 975       0.4972444
## 976       0.4972092
## 977       0.4971826
## 978       0.4971500
## 979       0.4971297
## 980       0.4971180
## 981       0.4970825
## 982       0.4970470
## 983       0.4970396
## 984       0.4970265
## 985       0.4970232
## 986       0.4970182
## 987       0.4970146
## 988       0.4970064
## 989       0.4970021
## 990       0.4969934
## 991       0.4969926
## 992       0.4969832
## 993       0.4969485
## 994       0.4969434
## 995       0.4969290
## 996       0.4969166
## 997       0.4969097
## 998       0.4969068
## 999       0.4968929
## 1000      0.4968911
## 1001      0.4968806
## 1002      0.4968762
## 1003      0.4968680
## 1004      0.4968465
## 1005      0.4968416
## 1006      0.4968225
## 1007      0.4967843
## 1008      0.4967805
## 1009      0.4967792
## 1010      0.4967735
## 1011      0.4967516
## 1012      0.4967445
## 1013      0.4967320
## 1014      0.4967298
## 1015      0.4967046
## 1016      0.4966631
## 1017      0.4966336
## 1018      0.4966133
## 1019      0.4965820
## 1020      0.4965712
## 1021      0.4965649
## 1022      0.4965478
## 1023      0.4965417
## 1024      0.4965366
## 1025      0.4965235
## 1026      0.4965084
## 1027      0.4965080
## 1028      0.4965051
## 1029      0.4964999
## 1030      0.4964936
## 1031      0.4964912
## 1032      0.4964787
## 1033      0.4964616
## 1034      0.4964402
## 1035      0.4964327
## 1036      0.4964236
## 1037      0.4963920
## 1038      0.4963918
## 1039      0.4963802
## 1040      0.4963798
## 1041      0.4963750
## 1042      0.4963723
## 1043      0.4963603
## 1044      0.4963358
## 1045      0.4963079
## 1046      0.4963060
## 1047      0.4963052
## 1048      0.4962919
## 1049      0.4962603
## 1050      0.4962548
## 1051      0.4962241
## 1052      0.4962011
## 1053      0.4961981
## 1054      0.4961580
## 1055      0.4961477
## 1056      0.4961191
## 1057      0.4961086
## 1058      0.4960974
## 1059      0.4960970
## 1060      0.4960965
## 1061      0.4960761
## 1062      0.4960749
## 1063      0.4960686
## 1064      0.4960666
## 1065      0.4960526
## 1066      0.4960268
## 1067      0.4960237
## 1068      0.4960075
## 1069      0.4959953
## 1070      0.4959921
## 1071      0.4959740
## 1072      0.4959704
## 1073      0.4959696
## 1074      0.4959574
## 1075      0.4959508
## 1076      0.4959498
## 1077      0.4959203
## 1078      0.4959197
## 1079      0.4958998
## 1080      0.4958883
## 1081      0.4958775
## 1082      0.4958762
## 1083      0.4958299
## 1084      0.4958278
## 1085      0.4958255
## 1086      0.4958254
## 1087      0.4958098
## 1088      0.4957380
## 1089      0.4957280
## 1090      0.4957194
## 1091      0.4956882
## 1092      0.4956834
## 1093      0.4956756
## 1094      0.4956646
## 1095      0.4956327
## 1096      0.4956107
## 1097      0.4956065
## 1098      0.4955980
## 1099      0.4955919
## 1100      0.4955915
## 1101      0.4955861
## 1102      0.4955814
## 1103      0.4955564
## 1104      0.4955537
## 1105      0.4955478
## 1106      0.4955390
## 1107      0.4955045
## 1108      0.4954986
## 1109      0.4954879
## 1110      0.4954847
## 1111      0.4954812
## 1112      0.4954777
## 1113      0.4954766
## 1114      0.4954757
## 1115      0.4954689
## 1116      0.4954592
## 1117      0.4954540
## 1118      0.4954254
## 1119      0.4954112
## 1120      0.4954102
## 1121      0.4954080
## 1122      0.4953937
## 1123      0.4953892
## 1124      0.4953877
## 1125      0.4953867
## 1126      0.4953858
## 1127      0.4953840
## 1128      0.4953837
## 1129      0.4953790
## 1130      0.4953622
## 1131      0.4953537
## 1132      0.4953523
## 1133      0.4953491
## 1134      0.4953452
## 1135      0.4953370
## 1136      0.4953232
## 1137      0.4953231
## 1138      0.4952918
## 1139      0.4952911
## 1140      0.4952807
## 1141      0.4952788
## 1142      0.4952764
## 1143      0.4952659
## 1144      0.4952537
## 1145      0.4952497
## 1146      0.4952487
## 1147      0.4952458
## 1148      0.4952225
## 1149      0.4952194
## 1150      0.4952181
## 1151      0.4952033
## 1152      0.4952009
## 1153      0.4951608
## 1154      0.4951583
## 1155      0.4951504
## 1156      0.4951306
## 1157      0.4951154
## 1158      0.4951106
## 1159      0.4950844
## 1160      0.4950817
## 1161      0.4950671
## 1162      0.4950544
## 1163      0.4950468
## 1164      0.4950411
## 1165      0.4950396
## 1166      0.4950356
## 1167      0.4950195
## 1168      0.4950141
## 1169      0.4950015
## 1170      0.4949968
## 1171      0.4949964
## 1172      0.4949880
## 1173      0.4949690
## 1174      0.4949501
## 1175      0.4949479
## 1176      0.4949377
## 1177      0.4949229
## 1178      0.4949013
## 1179      0.4948855
## 1180      0.4948552
## 1181      0.4948526
## 1182      0.4948441
## 1183      0.4948394
## 1184      0.4948019
## 1185      0.4947902
## 1186      0.4947845
## 1187      0.4947667
## 1188      0.4947625
## 1189      0.4947566
## 1190      0.4947497
## 1191      0.4947447
## 1192      0.4947422
## 1193      0.4947331
## 1194      0.4947280
## 1195      0.4947159
## 1196      0.4947119
## 1197      0.4947058
## 1198      0.4946929
## 1199      0.4946895
## 1200      0.4946873
## 1201      0.4946833
## 1202      0.4946786
## 1203      0.4946626
## 1204      0.4946566
## 1205      0.4946548
## 1206      0.4946471
## 1207      0.4946331
## 1208      0.4946277
## 1209      0.4945978
## 1210      0.4945886
## 1211      0.4945759
## 1212      0.4945753
## 1213      0.4945747
## 1214      0.4945597
## 1215      0.4945495
## 1216      0.4945400
## 1217      0.4944973
## 1218      0.4944679
## 1219      0.4944668
## 1220      0.4944639
## 1221      0.4944619
## 1222      0.4944543
## 1223      0.4944516
## 1224      0.4944500
## 1225      0.4944420
## 1226      0.4944333
## 1227      0.4944282
## 1228      0.4944206
## 1229      0.4943985
## 1230      0.4943963
## 1231      0.4943933
## 1232      0.4943474
## 1233      0.4943438
## 1234      0.4943351
## 1235      0.4943258
## 1236      0.4943210
## 1237      0.4942979
## 1238      0.4942975
## 1239      0.4942798
## 1240      0.4942779
## 1241      0.4942701
## 1242      0.4942682
## 1243      0.4942633
## 1244      0.4942444
## 1245      0.4942316
## 1246      0.4942276
## 1247      0.4942209
## 1248      0.4942190
## 1249      0.4942190
## 1250      0.4942162
## 1251      0.4942116
## 1252      0.4942052
## 1253      0.4941966
## 1254      0.4941824
## 1255      0.4941778
## 1256      0.4941772
## 1257      0.4941772
## 1258      0.4941702
## 1259      0.4941661
## 1260      0.4941646
## 1261      0.4941489
## 1262      0.4941361
## 1263      0.4941358
## 1264      0.4941320
## 1265      0.4941218
## 1266      0.4941082
## 1267      0.4941066
## 1268      0.4940849
## 1269      0.4940841
## 1270      0.4940757
## 1271      0.4940626
## 1272      0.4940565
## 1273      0.4940558
## 1274      0.4940425
## 1275      0.4940315
## 1276      0.4940099
## 1277      0.4940037
## 1278      0.4939874
## 1279      0.4939847
## 1280      0.4939652
## 1281      0.4939634
## 1282      0.4939537
## 1283      0.4939440
## 1284      0.4939387
## 1285      0.4939212
## 1286      0.4939142
## 1287      0.4938962
## 1288      0.4938940
## 1289      0.4938596
## 1290      0.4938575
## 1291      0.4938483
## 1292      0.4938426
## 1293      0.4938123
## 1294      0.4938039
## 1295      0.4937963
## 1296      0.4937638
## 1297      0.4937635
## 1298      0.4937623
## 1299      0.4937610
## 1300      0.4937529
## 1301      0.4937480
## 1302      0.4937445
## 1303      0.4937407
## 1304      0.4937405
## 1305      0.4937404
## 1306      0.4937377
## 1307      0.4937246
## 1308      0.4937213
## 1309      0.4937115
## 1310      0.4937086
## 1311      0.4936997
## 1312      0.4936988
## 1313      0.4936951
## 1314      0.4936782
## 1315      0.4936762
## 1316      0.4936752
## 1317      0.4936734
## 1318      0.4936627
## 1319      0.4936617
## 1320      0.4936343
## 1321      0.4936119
## 1322      0.4936056
## 1323      0.4935959
## 1324      0.4935892
## 1325      0.4935835
## 1326      0.4935792
## 1327      0.4935624
## 1328      0.4935482
## 1329      0.4935118
## 1330      0.4935115
## 1331      0.4935100
## 1332      0.4935061
## 1333      0.4934920
## 1334      0.4934606
## 1335      0.4934566
## 1336      0.4934537
## 1337      0.4934533
## 1338      0.4934493
## 1339      0.4934487
## 1340      0.4934482
## 1341      0.4934331
## 1342      0.4934328
## 1343      0.4934312
## 1344      0.4934140
## 1345      0.4933930
## 1346      0.4933906
## 1347      0.4933705
## 1348      0.4933658
## 1349      0.4933472
## 1350      0.4933355
## 1351      0.4933346
## 1352      0.4933031
## 1353      0.4932599
## 1354      0.4932477
## 1355      0.4932422
## 1356      0.4932393
## 1357      0.4932346
## 1358      0.4932227
## 1359      0.4932149
## 1360      0.4932091
## 1361      0.4932005
## 1362      0.4931851
## 1363      0.4931806
## 1364      0.4931703
## 1365      0.4931671
## 1366      0.4931659
## 1367      0.4931434
## 1368      0.4931287
## 1369      0.4931192
## 1370      0.4931084
## 1371      0.4930766
## 1372      0.4930755
## 1373      0.4930695
## 1374      0.4930635
## 1375      0.4930402
## 1376      0.4930198
## 1377      0.4929945
## 1378      0.4929904
## 1379      0.4929736
## 1380      0.4929701
## 1381      0.4929499
## 1382      0.4929407
## 1383      0.4929295
## 1384      0.4929294
## 1385      0.4929273
## 1386      0.4929248
## 1387      0.4929192
## 1388      0.4929088
## 1389      0.4929061
## 1390      0.4929053
## 1391      0.4928940
## 1392      0.4928925
## 1393      0.4928858
## 1394      0.4928562
## 1395      0.4928556
## 1396      0.4928459
## 1397      0.4928402
## 1398      0.4928274
## 1399      0.4928268
## 1400      0.4928057
## 1401      0.4928037
## 1402      0.4928007
## 1403      0.4927975
## 1404      0.4927858
## 1405      0.4927749
## 1406      0.4927657
## 1407      0.4927637
## 1408      0.4927311
## 1409      0.4927273
## 1410      0.4927218
## 1411      0.4927193
## 1412      0.4927097
## 1413      0.4926648
## 1414      0.4926439
## 1415      0.4926391
## 1416      0.4926270
## 1417      0.4926250
## 1418      0.4925900
## 1419      0.4925896
## 1420      0.4925669
## 1421      0.4925478
## 1422      0.4925470
## 1423      0.4925248
## 1424      0.4925056
## 1425      0.4925021
## 1426      0.4924901
## 1427      0.4924899
## 1428      0.4924836
## 1429      0.4924784
## 1430      0.4924717
## 1431      0.4924652
## 1432      0.4924557
## 1433      0.4924485
## 1434      0.4924482
## 1435      0.4924437
## 1436      0.4924210
## 1437      0.4924199
## 1438      0.4924188
## 1439      0.4924152
## 1440      0.4923967
## 1441      0.4923949
## 1442      0.4923935
## 1443      0.4923878
## 1444      0.4923791
## 1445      0.4923746
## 1446      0.4923560
## 1447      0.4923482
## 1448      0.4923356
## 1449      0.4923103
## 1450      0.4923100
## 1451      0.4923030
## 1452      0.4922965
## 1453      0.4922815
## 1454      0.4922729
## 1455      0.4922558
## 1456      0.4922393
## 1457      0.4922377
## 1458      0.4922339
## 1459      0.4922172
## 1460      0.4922100
## 1461      0.4921783
## 1462      0.4921757
## 1463      0.4921674
## 1464      0.4921648
## 1465      0.4921576
## 1466      0.4921478
## 1467      0.4921361
## 1468      0.4921277
## 1469      0.4921161
## 1470      0.4920912
## 1471      0.4920869
## 1472      0.4920754
## 1473      0.4920676
## 1474      0.4920643
## 1475      0.4920564
## 1476      0.4920564
## 1477      0.4920539
## 1478      0.4920482
## 1479      0.4920387
## 1480      0.4920334
## 1481      0.4920316
## 1482      0.4920246
## 1483      0.4920065
## 1484      0.4919988
## 1485      0.4919983
## 1486      0.4919977
## 1487      0.4919917
## 1488      0.4919852
## 1489      0.4919702
## 1490      0.4919676
## 1491      0.4919625
## 1492      0.4919593
## 1493      0.4919474
## 1494      0.4919013
## 1495      0.4918911
## 1496      0.4918887
## 1497      0.4918836
## 1498      0.4918439
## 1499      0.4918433
## 1500      0.4918391
## 1501      0.4918322
## 1502      0.4918275
## 1503      0.4918205
## 1504      0.4918108
## 1505      0.4918063
## 1506      0.4918043
## 1507      0.4918039
## 1508      0.4917676
## 1509      0.4917497
## 1510      0.4917455
## 1511      0.4917336
## 1512      0.4917335
## 1513      0.4917306
## 1514      0.4917289
## 1515      0.4917260
## 1516      0.4917197
## 1517      0.4917127
## 1518      0.4917104
## 1519      0.4917075
## 1520      0.4916950
## 1521      0.4916752
## 1522      0.4916651
## 1523      0.4916124
## 1524      0.4915808
## 1525      0.4915692
## 1526      0.4915692
## 1527      0.4915657
## 1528      0.4915420
## 1529      0.4915265
## 1530      0.4915022
## 1531      0.4915015
## 1532      0.4914702
## 1533      0.4914563
## 1534      0.4914413
## 1535      0.4914357
## 1536      0.4914300
## 1537      0.4914259
## 1538      0.4914173
## 1539      0.4914030
## 1540      0.4913996
## 1541      0.4913969
## 1542      0.4913916
## 1543      0.4913835
## 1544      0.4913787
## 1545      0.4913672
## 1546      0.4913605
## 1547      0.4913580
## 1548      0.4913455
## 1549      0.4913197
## 1550      0.4913173
## 1551      0.4913164
## 1552      0.4913115
## 1553      0.4913095
## 1554      0.4912997
## 1555      0.4912928
## 1556      0.4912922
## 1557      0.4912900
## 1558      0.4912793
## 1559      0.4912619
## 1560      0.4912549
## 1561      0.4912541
## 1562      0.4912513
## 1563      0.4912459
## 1564      0.4912358
## 1565      0.4912086
## 1566      0.4912029
## 1567      0.4912027
## 1568      0.4911923
## 1569      0.4911301
## 1570      0.4911253
## 1571      0.4911243
## 1572      0.4911192
## 1573      0.4911157
## 1574      0.4911022
## 1575      0.4911012
## 1576      0.4910999
## 1577      0.4910872
## 1578      0.4910731
## 1579      0.4910629
## 1580      0.4910530
## 1581      0.4910476
## 1582      0.4910125
## 1583      0.4910036
## 1584      0.4909962
## 1585      0.4909923
## 1586      0.4909878
## 1587      0.4909747
## 1588      0.4909562
## 1589      0.4909468
## 1590      0.4909436
## 1591      0.4909055
## 1592      0.4908984
## 1593      0.4908972
## 1594      0.4908836
## 1595      0.4908708
## 1596      0.4908575
## 1597      0.4908508
## 1598      0.4908462
## 1599      0.4908445
## 1600      0.4908278
## 1601      0.4908272
## 1602      0.4908232
## 1603      0.4908151
## 1604      0.4908106
## 1605      0.4907991
## 1606      0.4907979
## 1607      0.4907703
## 1608      0.4907556
## 1609      0.4907479
## 1610      0.4907223
## 1611      0.4907009
## 1612      0.4906741
## 1613      0.4906554
## 1614      0.4906548
## 1615      0.4906418
## 1616      0.4906403
## 1617      0.4906399
## 1618      0.4906087
## 1619      0.4906013
## 1620      0.4905953
## 1621      0.4905883
## 1622      0.4905334
## 1623      0.4905304
## 1624      0.4905299
## 1625      0.4905194
## 1626      0.4905133
## 1627      0.4905128
## 1628      0.4905034
## 1629      0.4904736
## 1630      0.4904707
## 1631      0.4904675
## 1632      0.4904625
## 1633      0.4904621
## 1634      0.4904368
## 1635      0.4904310
## 1636      0.4904210
## 1637      0.4904166
## 1638      0.4904000
## 1639      0.4903854
## 1640      0.4903543
## 1641      0.4903520
## 1642      0.4903447
## 1643      0.4903415
## 1644      0.4903388
## 1645      0.4903330
## 1646      0.4903000
## 1647      0.4902957
## 1648      0.4902847
## 1649      0.4902350
## 1650      0.4902141
## 1651      0.4902021
## 1652      0.4902013
## 1653      0.4901623
## 1654      0.4901601
## 1655      0.4901520
## 1656      0.4901398
## 1657      0.4901266
## 1658      0.4901191
## 1659      0.4901190
## 1660      0.4901072
## 1661      0.4900995
## 1662      0.4900832
## 1663      0.4900804
## 1664      0.4900717
## 1665      0.4900608
## 1666      0.4900511
## 1667      0.4900504
## 1668      0.4900387
## 1669      0.4900362
## 1670      0.4900341
## 1671      0.4900196
## 1672      0.4900130
## 1673      0.4900117
## 1674      0.4899836
## 1675      0.4899356
## 1676      0.4899324
## 1677      0.4899311
## 1678      0.4899195
## 1679      0.4899134
## 1680      0.4899062
## 1681      0.4898934
## 1682      0.4898592
## 1683      0.4898334
## 1684      0.4898246
## 1685      0.4898221
## 1686      0.4898136
## 1687      0.4898109
## 1688      0.4898047
## 1689      0.4898024
## 1690      0.4897776
## 1691      0.4897775
## 1692      0.4897664
## 1693      0.4897628
## 1694      0.4897619
## 1695      0.4897466
## 1696      0.4897386
## 1697      0.4897370
## 1698      0.4897314
## 1699      0.4897018
## 1700      0.4896879
## 1701      0.4896842
## 1702      0.4896812
## 1703      0.4896761
## 1704      0.4896680
## 1705      0.4896397
## 1706      0.4896326
## 1707      0.4896273
## 1708      0.4896236
## 1709      0.4895823
## 1710      0.4895642
## 1711      0.4895635
## 1712      0.4895551
## 1713      0.4895486
## 1714      0.4895459
## 1715      0.4895347
## 1716      0.4895262
## 1717      0.4895255
## 1718      0.4895012
## 1719      0.4894990
## 1720      0.4894980
## 1721      0.4894929
## 1722      0.4894662
## 1723      0.4894524
## 1724      0.4894411
## 1725      0.4894237
## 1726      0.4894216
## 1727      0.4894040
## 1728      0.4893971
## 1729      0.4893617
## 1730      0.4893590
## 1731      0.4893514
## 1732      0.4893358
## 1733      0.4893285
## 1734      0.4893277
## 1735      0.4893154
## 1736      0.4892961
## 1737      0.4892815
## 1738      0.4892787
## 1739      0.4892780
## 1740      0.4892763
## 1741      0.4892270
## 1742      0.4892218
## 1743      0.4892214
## 1744      0.4892184
## 1745      0.4892154
## 1746      0.4892096
## 1747      0.4892081
## 1748      0.4891989
## 1749      0.4891706
## 1750      0.4891317
## 1751      0.4891245
## 1752      0.4891137
## 1753      0.4890859
## 1754      0.4890706
## 1755      0.4890500
## 1756      0.4890399
## 1757      0.4890270
## 1758      0.4890244
## 1759      0.4890146
## 1760      0.4889964
## 1761      0.4889875
## 1762      0.4889623
## 1763      0.4889354
## 1764      0.4889312
## 1765      0.4889250
## 1766      0.4889240
## 1767      0.4888935
## 1768      0.4888682
## 1769      0.4888663
## 1770      0.4888608
## 1771      0.4888504
## 1772      0.4888413
## 1773      0.4888239
## 1774      0.4888133
## 1775      0.4888079
## 1776      0.4887613
## 1777      0.4887514
## 1778      0.4887418
## 1779      0.4887405
## 1780      0.4887398
## 1781      0.4887380
## 1782      0.4887132
## 1783      0.4887089
## 1784      0.4887074
## 1785      0.4886883
## 1786      0.4886529
## 1787      0.4886520
## 1788      0.4886349
## 1789      0.4886268
## 1790      0.4886255
## 1791      0.4886181
## 1792      0.4886060
## 1793      0.4885993
## 1794      0.4885725
## 1795      0.4885045
## 1796      0.4885006
## 1797      0.4884921
## 1798      0.4884910
## 1799      0.4884866
## 1800      0.4884866
## 1801      0.4884833
## 1802      0.4884797
## 1803      0.4884724
## 1804      0.4884671
## 1805      0.4884656
## 1806      0.4884626
## 1807      0.4884507
## 1808      0.4884368
## 1809      0.4884354
## 1810      0.4884189
## 1811      0.4883949
## 1812      0.4883695
## 1813      0.4883476
## 1814      0.4883039
## 1815      0.4882897
## 1816      0.4882569
## 1817      0.4882561
## 1818      0.4882122
## 1819      0.4882062
## 1820      0.4881867
## 1821      0.4881520
## 1822      0.4881514
## 1823      0.4881438
## 1824      0.4881426
## 1825      0.4881349
## 1826      0.4881322
## 1827      0.4881301
## 1828      0.4881239
## 1829      0.4881232
## 1830      0.4881010
## 1831      0.4880479
## 1832      0.4880306
## 1833      0.4880106
## 1834      0.4880059
## 1835      0.4880047
## 1836      0.4880029
## 1837      0.4879696
## 1838      0.4879573
## 1839      0.4879539
## 1840      0.4879465
## 1841      0.4879033
## 1842      0.4878991
## 1843      0.4878714
## 1844      0.4878680
## 1845      0.4878654
## 1846      0.4878497
## 1847      0.4878443
## 1848      0.4878243
## 1849      0.4878237
## 1850      0.4877756
## 1851      0.4877667
## 1852      0.4877599
## 1853      0.4877556
## 1854      0.4877514
## 1855      0.4877497
## 1856      0.4877495
## 1857      0.4877388
## 1858      0.4877378
## 1859      0.4877208
## 1860      0.4877102
## 1861      0.4876945
## 1862      0.4876565
## 1863      0.4876368
## 1864      0.4876362
## 1865      0.4876163
## 1866      0.4875849
## 1867      0.4875802
## 1868      0.4875800
## 1869      0.4875761
## 1870      0.4875479
## 1871      0.4875298
## 1872      0.4875132
## 1873      0.4874857
## 1874      0.4874706
## 1875      0.4874585
## 1876      0.4874401
## 1877      0.4873936
## 1878      0.4873766
## 1879      0.4873686
## 1880      0.4873671
## 1881      0.4873607
## 1882      0.4873558
## 1883      0.4873372
## 1884      0.4873361
## 1885      0.4873118
## 1886      0.4872867
## 1887      0.4872774
## 1888      0.4872270
## 1889      0.4872201
## 1890      0.4871982
## 1891      0.4871958
## 1892      0.4871764
## 1893      0.4871665
## 1894      0.4871243
## 1895      0.4871126
## 1896      0.4870881
## 1897      0.4870732
## 1898      0.4870640
## 1899      0.4870503
## 1900      0.4870482
## 1901      0.4870450
## 1902      0.4870271
## 1903      0.4870217
## 1904      0.4869941
## 1905      0.4869874
## 1906      0.4869729
## 1907      0.4869193
## 1908      0.4868907
## 1909      0.4868895
## 1910      0.4868748
## 1911      0.4868657
## 1912      0.4868535
## 1913      0.4868436
## 1914      0.4868353
## 1915      0.4868313
## 1916      0.4867793
## 1917      0.4867781
## 1918      0.4867615
## 1919      0.4867403
## 1920      0.4867205
## 1921      0.4866979
## 1922      0.4866872
## 1923      0.4866798
## 1924      0.4866757
## 1925      0.4866613
## 1926      0.4866507
## 1927      0.4866416
## 1928      0.4866370
## 1929      0.4866267
## 1930      0.4866101
## 1931      0.4866062
## 1932      0.4866032
## 1933      0.4865821
## 1934      0.4865687
## 1935      0.4865389
## 1936      0.4865317
## 1937      0.4865030
## 1938      0.4864694
## 1939      0.4864541
## 1940      0.4864488
## 1941      0.4864359
## 1942      0.4864065
## 1943      0.4863710
## 1944      0.4863652
## 1945      0.4863517
## 1946      0.4863184
## 1947      0.4863099
## 1948      0.4862876
## 1949      0.4862774
## 1950      0.4862590
## 1951      0.4862473
## 1952      0.4862175
## 1953      0.4862097
## 1954      0.4862076
## 1955      0.4862067
## 1956      0.4861778
## 1957      0.4861698
## 1958      0.4861636
## 1959      0.4861603
## 1960      0.4861480
## 1961      0.4861311
## 1962      0.4861077
## 1963      0.4861000
## 1964      0.4860965
## 1965      0.4860917
## 1966      0.4860719
## 1967      0.4860129
## 1968      0.4860118
## 1969      0.4859841
## 1970      0.4859841
## 1971      0.4859839
## 1972      0.4859519
## 1973      0.4859403
## 1974      0.4859377
## 1975      0.4859360
## 1976      0.4859212
## 1977      0.4859096
## 1978      0.4859064
## 1979      0.4859048
## 1980      0.4859034
## 1981      0.4858979
## 1982      0.4858904
## 1983      0.4858653
## 1984      0.4858603
## 1985      0.4857728
## 1986      0.4857350
## 1987      0.4856794
## 1988      0.4856747
## 1989      0.4856720
## 1990      0.4856555
## 1991      0.4856409
## 1992      0.4856369
## 1993      0.4856236
## 1994      0.4856126
## 1995      0.4856120
## 1996      0.4856073
## 1997      0.4855697
## 1998      0.4855654
## 1999      0.4855602
## 2000      0.4855532
## 2001      0.4855515
## 2002      0.4855504
## 2003      0.4854965
## 2004      0.4854857
## 2005      0.4854840
## 2006      0.4854836
## 2007      0.4854514
## 2008      0.4854449
## 2009      0.4854263
## 2010      0.4853983
## 2011      0.4853960
## 2012      0.4853853
## 2013      0.4853529
## 2014      0.4853316
## 2015      0.4852611
## 2016      0.4852605
## 2017      0.4852559
## 2018      0.4852360
## 2019      0.4852334
## 2020      0.4852055
## 2021      0.4852025
## 2022      0.4851930
## 2023      0.4851625
## 2024      0.4851512
## 2025      0.4851325
## 2026      0.4851263
## 2027      0.4851082
## 2028      0.4851072
## 2029      0.4850903
## 2030      0.4850736
## 2031      0.4850192
## 2032      0.4850010
## 2033      0.4849638
## 2034      0.4849510
## 2035      0.4849360
## 2036      0.4848384
## 2037      0.4848290
## 2038      0.4848163
## 2039      0.4847694
## 2040      0.4847581
## 2041      0.4847565
## 2042      0.4847421
## 2043      0.4847401
## 2044      0.4847398
## 2045      0.4847256
## 2046      0.4846116
## 2047      0.4845635
## 2048      0.4845406
## 2049      0.4845306
## 2050      0.4844990
## 2051      0.4844627
## 2052      0.4844554
## 2053      0.4844518
## 2054      0.4844478
## 2055      0.4844359
## 2056      0.4844308
## 2057      0.4844114
## 2058      0.4843937
## 2059      0.4843932
## 2060      0.4843915
## 2061      0.4843833
## 2062      0.4843750
## 2063      0.4843555
## 2064      0.4843266
## 2065      0.4843243
## 2066      0.4842855
## 2067      0.4842716
## 2068      0.4842443
## 2069      0.4842206
## 2070      0.4842205
## 2071      0.4842039
## 2072      0.4841678
## 2073      0.4841663
## 2074      0.4841661
## 2075      0.4841636
## 2076      0.4841051
## 2077      0.4841015
## 2078      0.4840992
## 2079      0.4840877
## 2080      0.4840793
## 2081      0.4840574
## 2082      0.4840470
## 2083      0.4840322
## 2084      0.4840226
## 2085      0.4840072
## 2086      0.4840046
## 2087      0.4839953
## 2088      0.4839887
## 2089      0.4839662
## 2090      0.4839575
## 2091      0.4839512
## 2092      0.4839469
## 2093      0.4839178
## 2094      0.4839062
## 2095      0.4838826
## 2096      0.4838483
## 2097      0.4838214
## 2098      0.4837958
## 2099      0.4837908
## 2100      0.4837640
## 2101      0.4837466
## 2102      0.4837436
## 2103      0.4837108
## 2104      0.4837076
## 2105      0.4837040
## 2106      0.4836900
## 2107      0.4836818
## 2108      0.4836443
## 2109      0.4836167
## 2110      0.4835515
## 2111      0.4835220
## 2112      0.4835002
## 2113      0.4834494
## 2114      0.4834451
## 2115      0.4834275
## 2116      0.4834159
## 2117      0.4834076
## 2118      0.4834026
## 2119      0.4833758
## 2120      0.4833649
## 2121      0.4833510
## 2122      0.4832992
## 2123      0.4832756
## 2124      0.4832110
## 2125      0.4831785
## 2126      0.4831464
## 2127      0.4831335
## 2128      0.4831203
## 2129      0.4831095
## 2130      0.4830894
## 2131      0.4829798
## 2132      0.4829722
## 2133      0.4829683
## 2134      0.4829423
## 2135      0.4828986
## 2136      0.4828502
## 2137      0.4828000
## 2138      0.4827994
## 2139      0.4827755
## 2140      0.4827741
## 2141      0.4827706
## 2142      0.4827393
## 2143      0.4827242
## 2144      0.4827127
## 2145      0.4826977
## 2146      0.4826914
## 2147      0.4826852
## 2148      0.4826780
## 2149      0.4826778
## 2150      0.4826513
## 2151      0.4826347
## 2152      0.4825893
## 2153      0.4825838
## 2154      0.4825545
## 2155      0.4825410
## 2156      0.4825260
## 2157      0.4825109
## 2158      0.4825109
## 2159      0.4824853
## 2160      0.4824783
## 2161      0.4824719
## 2162      0.4824717
## 2163      0.4824423
## 2164      0.4824258
## 2165      0.4824188
## 2166      0.4823580
## 2167      0.4823171
## 2168      0.4823151
## 2169      0.4823125
## 2170      0.4822860
## 2171      0.4822812
## 2172      0.4822769
## 2173      0.4822765
## 2174      0.4822255
## 2175      0.4821887
## 2176      0.4821754
## 2177      0.4821233
## 2178      0.4821127
## 2179      0.4821009
## 2180      0.4820787
## 2181      0.4820747
## 2182      0.4820523
## 2183      0.4820508
## 2184      0.4820373
## 2185      0.4820326
## 2186      0.4819289
## 2187      0.4818689
## 2188      0.4817392
## 2189      0.4817094
## 2190      0.4816555
## 2191      0.4816387
## 2192      0.4815704
## 2193      0.4814970
## 2194      0.4814945
## 2195      0.4814940
## 2196      0.4814631
## 2197      0.4814408
## 2198      0.4814203
## 2199      0.4814184
## 2200      0.4813859
## 2201      0.4813674
## 2202      0.4813462
## 2203      0.4813271
## 2204      0.4812683
## 2205      0.4812014
## 2206      0.4811923
## 2207      0.4811775
## 2208      0.4811276
## 2209      0.4811109
## 2210      0.4810994
## 2211      0.4810908
## 2212      0.4809451
## 2213      0.4809247
## 2214      0.4808876
## 2215      0.4808443
## 2216      0.4807926
## 2217      0.4807867
## 2218      0.4807531
## 2219      0.4806891
## 2220      0.4806813
## 2221      0.4806784
## 2222      0.4806697
## 2223      0.4805747
## 2224      0.4805512
## 2225      0.4804743
## 2226      0.4804598
## 2227      0.4803986
## 2228      0.4803462
## 2229      0.4803154
## 2230      0.4802894
## 2231      0.4802811
## 2232      0.4802764
## 2233      0.4802358
## 2234      0.4801222
## 2235      0.4800989
## 2236      0.4800793
## 2237      0.4800622
## 2238      0.4800297
## 2239      0.4799911
## 2240      0.4799908
## 2241      0.4799851
## 2242      0.4799373
## 2243      0.4798294
## 2244      0.4797716
## 2245      0.4797682
## 2246      0.4797631
## 2247      0.4797391
## 2248      0.4797124
## 2249      0.4796705
## 2250      0.4796664
## 2251      0.4796618
## 2252      0.4796039
## 2253      0.4795888
## 2254      0.4795835
## 2255      0.4794960
## 2256      0.4793986
## 2257      0.4793831
## 2258      0.4793292
## 2259      0.4793290
## 2260      0.4793103
## 2261      0.4792999
## 2262      0.4792981
## 2263      0.4791218
## 2264      0.4790184
## 2265      0.4790163
## 2266      0.4790048
## 2267      0.4790012
## 2268      0.4789782
## 2269      0.4789600
## 2270      0.4789310
## 2271      0.4788675
## 2272      0.4788405
## 2273      0.4788377
## 2274      0.4788267
## 2275      0.4787917
## 2276      0.4787771
## 2277      0.4787698
## 2278      0.4787271
## 2279      0.4787123
## 2280      0.4786923
## 2281      0.4786254
## 2282      0.4786221
## 2283      0.4785430
## 2284      0.4785047
## 2285      0.4785031
## 2286      0.4784925
## 2287      0.4783300
## 2288      0.4782714
## 2289      0.4782364
## 2290      0.4782140
## 2291      0.4781421
## 2292      0.4781314
## 2293      0.4780045
## 2294      0.4777156
## 2295      0.4777136
## 2296      0.4777136
## 2297      0.4776968
## 2298      0.4776119
## 2299      0.4775670
## 2300      0.4774845
## 2301      0.4774747
## 2302      0.4773331
## 2303      0.4771167
## 2304      0.4770998
## 2305      0.4770992
## 2306      0.4770924
## 2307      0.4770701
## 2308      0.4770680
## 2309      0.4768757
## 2310      0.4768663
## 2311      0.4768446
## 2312      0.4768206
## 2313      0.4767898
## 2314      0.4767484
## 2315      0.4767376
## 2316      0.4766363
## 2317      0.4765423
## 2318      0.4765363
## 2319      0.4765076
## 2320      0.4764464
## 2321      0.4763594
## 2322      0.4763396
## 2323      0.4761395
## 2324      0.4761254
## 2325      0.4761014
## 2326      0.4760906
## 2327      0.4760885
## 2328      0.4760716
## 2329      0.4760449
## 2330      0.4760255
## 2331      0.4760070
## 2332      0.4758927
## 2333      0.4758770
## 2334      0.4758601
## 2335      0.4758322
## 2336      0.4757809
## 2337      0.4756723
## 2338      0.4756186
## 2339      0.4756112
## 2340      0.4755935
## 2341      0.4755621
## 2342      0.4754968
## 2343      0.4753068
## 2344      0.4752886
## 2345      0.4752739
## 2346      0.4752347
## 2347      0.4751499
## 2348      0.4750801
## 2349      0.4749712
## 2350      0.4746951
## 2351      0.4746187
## 2352      0.4742925
## 2353      0.4741859
## 2354      0.4740891
## 2355      0.4740748
## 2356      0.4738935
## 2357      0.4738310
## 2358      0.4737851
## 2359      0.4737359
## 2360      0.4736340
## 2361      0.4735562
## 2362      0.4735497
## 2363      0.4735216
## 2364      0.4734928
## 2365      0.4734541
## 2366      0.4733735
## 2367      0.4733606
## 2368      0.4733208
## 2369      0.4732941
## 2370      0.4731925
## 2371      0.4730473
## 2372      0.4730137
## 2373      0.4727901
## 2374      0.4727867
## 2375      0.4726492
## 2376      0.4724684
## 2377      0.4723389
## 2378      0.4721882
## 2379      0.4720383
## 2380      0.4718814
## 2381      0.4718577
## 2382      0.4718409
## 2383      0.4716337
## 2384      0.4715985
## 2385      0.4714063
## 2386      0.4713055
## 2387      0.4712029
## 2388      0.4709909
## 2389      0.4709130
## 2390      0.4708419
## 2391      0.4708311
## 2392      0.4707836
## 2393      0.4707714
## 2394      0.4706128
## 2395      0.4705342
## 2396      0.4697792
## 2397      0.4696033
## 2398      0.4692183
## 2399      0.4689411
## 2400      0.4687771
## 2401      0.4686669
## 2402      0.4683205
## 2403      0.4683159
## 2404      0.4682157
## 2405      0.4680899
## 2406      0.4676354
## 2407      0.4673140
## 2408      0.4668370
## 2409      0.4667699
## 2410      0.4666180
## 2411      0.4656530
## 2412      0.4655330
## 2413      0.4654782
## 2414      0.4654592
## 2415      0.4652596
## 2416      0.4648587
## 2417      0.4647793
## 2418      0.4645829
## 2419      0.4637693
## 2420      0.4633393
## 2421      0.4630872
## 2422      0.4625594
## 2423      0.4625406
## 2424      0.4614034
## 2425      0.4600568
## 2426      0.4598886
## 2427      0.4594149
## 2428      0.4583686
## 2429      0.4578879
## 2430      0.4572674
## 2431      0.4566598
## 2432      0.4549509
## 2433      0.4517763
## 2434      0.4517419
## 2435      0.4512033
## 2436      0.4489315
## 2437      0.4426205
## 2438      0.4198447

Notice Sussex County in Virginia is more than two thirds male: this is because of two men’s prisons in the county.


2: Aggregating Data

Now that you know how to transform your data, you’ll want to know more about how to aggregate your data to make it more interpretable. You’ll learn a number of functions you can use to take many observations in your data and summarize them, including count, group_by, summarize, ungroup, and top_n.

Video: The count verb

Counting by region

The counties dataset contains columns for region, state, population, and the number of citizens, which we selected and saved as the counties_selected table. In this exercise, you’ll focus on the region column.

counties_selected <- counties %>%
  select(region, state, population, citizens)

Use count to find the number of counties in each region

counties_selected %>%
  count(region, sort=TRUE)
## # A tibble: 4 x 2
##   region            n
##   <fct>         <int>
## 1 South          1420
## 2 North Central  1055
## 3 West            448
## 4 Northeast       218

Since the results have been arranged, you can see that the South has the greatest number of counties.

Counting citizens by state

You can weigh your count by particular variables rather than finding the number of counties. In this case, you’ll find the number of citizens in each state.

counties_selected <- counties %>%
  select(region, state, population, citizens)

Find number of counties per state, weighted by citizens

counties_selected %>%
count(state, wt=citizens, sort=TRUE)
## # A tibble: 50 x 2
##    state                 n
##    <fct>             <int>
##  1 California     24280349
##  2 Texas          16864962
##  3 Florida        13933052
##  4 New York       13531404
##  5 Pennsylvania    9710416
##  6 Illinois        8979999
##  7 Ohio            8709050
##  8 Michigan        7380136
##  9 North Carolina  7107998
## 10 Georgia         6978660
## # ... with 40 more rows

From our result, we can see that California is the state with the most citizens.

Mutating and counting

You can combine multiple verbs together to answer increasingly complicated questions of your data. For example: “What are the US states where the most people walk to work?”

You’ll use the walk column, which offers a percentage of people in each county that walk to work, to add a new column and count based on it.

counties_selected <- counties %>%
  select(state, population, walk)
counties_selected %>%
  # Add population_walk containing the total number of people who walk to work 
  mutate(population_walk = population * walk / 100) %>%
  # Count weighted by the new column
  count(state, wt = population_walk, sort=TRUE)
## # A tibble: 50 x 2
##    state                n
##    <fct>            <dbl>
##  1 New York      1237938.
##  2 California    1017964.
##  3 Pennsylvania   505397.
##  4 Texas          430793.
##  5 Illinois       400346.
##  6 Massachusetts  316765.
##  7 Florida        284723.
##  8 New Jersey     273047.
##  9 Ohio           266911.
## 10 Washington     239764.
## # ... with 40 more rows

We can see that while California had the largest total population, New York state has the largest number of people who walk to work.

Video: The group by, summarize and ungroup verbs

Summarizing

The summarize() verb is very useful for collapsing a large dataset into a single observation.

counties_selected <- counties %>%
  select(county, population, income, unemployment)

Summarize to find minimum population, maximum unemployment, and average income

counties_selected %>%
  summarise(min_population = min(population), max_unemployment = max(unemployment), average_income = mean(income))
##   min_population max_unemployment average_income
## 1             85             29.4             NA

If we wanted to take this a step further, we could use filter() to determine the specific counties that returned the value for min_population and max_unemployment.

counties_selected %>% 
  filter(population == min(population))
##    county population income unemployment
## 1 Kalawao         85  66250            0
counties_selected %>% 
  filter(unemployment == max(unemployment))
##   county population income unemployment
## 1 Corson       4149  31676         29.4

Summarizing by state

Another interesting column is land_area, which shows the land area in square miles. Here, you’ll summarize both population and land area by state, with the purpose of finding the density (in people per square miles).

counties_selected <- counties %>%
  select(state, county, population, land_area)

Add a density column, then sort in descending order

counties_selected %>%
  group_by(state) %>%
  summarize(total_area = sum(land_area),
            total_population = sum(population)) %>%
  mutate(density = total_population / total_area) %>%
  arrange(desc(density))
## # A tibble: 50 x 4
##    state         total_area total_population density
##    <fct>              <dbl>            <int>   <dbl>
##  1 New Jersey         7354.          8904413   1211.
##  2 Rhode Island       1034.          1053661   1019.
##  3 Massachusetts      7800.          6705586    860.
##  4 Connecticut        4842.          3593222    742.
##  5 Maryland           9707.          5930538    611.
##  6 Delaware           1949.           926454    475.
##  7 New York          47126.         19673174    417.
##  8 Florida           53625.         19645772    366.
##  9 Pennsylvania      44743.         12779559    286.
## 10 Ohio              40861.         11575977    283.
## # ... with 40 more rows

Looks like New Jersey and Rhode Island are the “most crowded” of the US states, with more than a thousand people per square mile.

Summarizing by state and region

You can group by multiple columns instead of grouping by one. Here, you’ll practice aggregating by state and region, and notice how useful it is for performing multiple aggregations in a row.

counties_selected <- counties %>%
  select(region, state, county, population)

Calculate the average_pop and median_pop columns

counties_selected %>%
  group_by(region, state) %>%
  summarize(total_pop = sum(population)) %>%
  summarize(average_pop = mean(total_pop), median_pop = median(total_pop))
## # A tibble: 4 x 3
##   region        average_pop median_pop
##   <fct>               <dbl>      <dbl>
## 1 North Central    5628866.    5580644
## 2 Northeast        5600438.    2461161
## 3 South            7369565.    4804098
## 4 West             5723364.    2798636

It looks like the South has the highest average_pop of 7370486, while the North Central region has the highest median_pop of 5580644.

Video: The top_n verb

Selecting a county from each region

Previously, you used the walk column, which offers a percentage of people in each county that walk to work, to add a new column and count to find the total number of people who walk to work in each county.

Now, you’re interested in finding the county within each region with the highest percentage of citizens who walk to work.

counties_selected <- counties %>%
  select(region, state, county, metro, population, walk)

Group by region and find the greatest number of citizens who walk to work

counties_selected %>%
  group_by(region) %>%
  top_n(1, walk)
## # A tibble: 4 x 6
## # Groups:   region [4]
##   region        state        county                 metro    population  walk
##   <fct>         <fct>        <fct>                  <fct>         <int> <dbl>
## 1 West          Alaska       Aleutians East Borough Nonmetro       3304  71.2
## 2 Northeast     New York     New York               Metro       1629507  20.7
## 3 North Central North Dakota McIntosh               Nonmetro       2759  17.5
## 4 South         Virginia     Lexington city         Nonmetro       7071  31.7

Notice that three of the places lots of people walk to work are low-population nonmetro counties, but that New York City also pops up!

Finding the highest-income state in each region

You’ve been learning to combine multiple dplyr verbs together. Here, you’ll combine group_by(), summarize(), and top_n() to find the state in each region with the highest income.

When you group by multiple columns and then summarize, it’s important to remember that the summarize “peels off” one of the groups, but leaves the rest on. For example, if you group_by(X, Y) then summarize, the result will still be grouped by X.

counties_selected <- counties %>%
  select(region, state, county, population, income)
counties_selected %>%
  group_by(region, state) %>%
  # Calculate average income
  summarise(average_income = mean(income)) %>%
  # Find the highest income state in each region
  top_n(1, average_income)
## # A tibble: 4 x 3
## # Groups:   region [4]
##   region        state        average_income
##   <fct>         <fct>                 <dbl>
## 1 North Central North Dakota         55575.
## 2 Northeast     New Jersey           73014.
## 3 South         Maryland             69200.
## 4 West          Hawaii               64879

From our results, we can see that New Jersey in the Northeast is the state with the highest average_income of 73014.

Using summarize, top_n, and count together

In this chapter, you’ve learned to use five dplyr verbs related to aggregation: count(), group_by(), summarize(), ungroup(), and top_n(). In this exercise, you’ll use all of them to answer a question: In how many states do more people live in metro areas than non-metro areas?

Recall that the metro column has one of the two values “Metro” (for high-density city areas) or “Nonmetro” (for suburban and country areas).

counties_selected <- counties %>%
  select(state, metro, population)

Count the states with more people in Metro or Nonmetro areas

counties_selected %>%
  group_by(state, metro) %>%
  summarize(total_pop = sum(population)) %>%
  top_n(1, total_pop) %>%
  ungroup(total_pop) %>%
  count(metro)
## # A tibble: 1 x 2
##   metro     n
##   <fct> <int>
## 1 ""       50

Notice that 44 states have more people living in Metro areas, and 6 states have more people living in Nonmetro areas.


3: Selecting and Transforming Data

Learn advanced methods to select and transform columns. Also learn about select helpers, which are functions that specify criteria for columns you want to choose, as well as the rename and transmute verbs.

Video: Selecting

Selecting columns

Using the select verb, we can answer interesting questions about our dataset by focusing in on related groups of verbs. The colon (:) is useful for getting many columns at a time.

Glimpse the counties table

glimpse(counties)
## Observations: 3,141
## Variables: 40
## $ census_id          <int> 1001, 1003, 1005, 1007, 1009, 1011, 1013, 1015, ...
## $ state              <fct> Alabama, Alabama, Alabama, Alabama, Alabama, Ala...
## $ county             <fct> Autauga, Baldwin, Barbour, Bibb, Blount, Bullock...
## $ region             <fct> South, South, South, South, South, South, South,...
## $ metro              <fct> , , , , , , , , , , , , , , , , , , , , , , , , , 
## $ population         <int> 55221, 195121, 26932, 22604, 57710, 10678, 20354...
## $ men                <int> 26745, 95314, 14497, 12073, 28512, 5660, 9502, 5...
## $ women              <int> 28476, 99807, 12435, 10531, 29198, 5018, 10852, ...
## $ hispanic           <dbl> 2.6, 4.5, 4.6, 2.2, 8.6, 4.4, 1.2, 3.5, 0.4, 1.5...
## $ white              <dbl> 75.8, 83.1, 46.2, 74.5, 87.9, 22.2, 53.3, 73.0, ...
## $ black              <dbl> 18.5, 9.5, 46.7, 21.4, 1.5, 70.7, 43.8, 20.3, 40...
## $ native             <dbl> 0.4, 0.6, 0.2, 0.4, 0.3, 1.2, 0.1, 0.2, 0.2, 0.6...
## $ asian              <dbl> 1.0, 0.7, 0.4, 0.1, 0.1, 0.2, 0.4, 0.9, 0.8, 0.3...
## $ pacific            <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...
## $ citizens           <int> 40725, 147695, 20714, 17495, 42345, 8057, 15581,...
## $ income             <int> 51281, 50254, 32964, 38678, 45813, 31938, 32229,...
## $ income_err         <int> 2391, 1263, 2973, 3995, 3141, 5884, 1793, 925, 2...
## $ income_per_cap     <int> 24974, 27317, 16824, 18431, 20532, 17580, 18390,...
## $ income_per_cap_err <int> 1080, 711, 798, 1618, 708, 2055, 714, 489, 1366,...
## $ poverty            <dbl> 12.9, 13.4, 26.7, 16.8, 16.7, 24.6, 25.4, 20.5, ...
## $ child_poverty      <dbl> 18.6, 19.2, 45.3, 27.9, 27.2, 38.4, 39.2, 31.6, ...
## $ professional       <dbl> 33.2, 33.1, 26.8, 21.5, 28.5, 18.8, 27.5, 27.3, ...
## $ service            <dbl> 17.0, 17.7, 16.1, 17.9, 14.1, 15.0, 16.6, 17.7, ...
## $ office             <dbl> 24.2, 27.1, 23.1, 17.8, 23.9, 19.7, 21.9, 24.2, ...
## $ construction       <dbl> 8.6, 10.8, 10.8, 19.0, 13.5, 20.1, 10.3, 10.5, 1...
## $ production         <dbl> 17.1, 11.2, 23.1, 23.7, 19.9, 26.4, 23.7, 20.4, ...
## $ drive              <dbl> 87.5, 84.7, 83.8, 83.2, 84.9, 74.9, 84.5, 85.3, ...
## $ carpool            <dbl> 8.8, 8.8, 10.9, 13.5, 11.2, 14.9, 12.4, 9.4, 11....
## $ transit            <dbl> 0.1, 0.1, 0.4, 0.5, 0.4, 0.7, 0.0, 0.2, 0.2, 0.2...
## $ walk               <dbl> 0.5, 1.0, 1.8, 0.6, 0.9, 5.0, 0.8, 1.2, 0.3, 0.6...
## $ other_transp       <dbl> 1.3, 1.4, 1.5, 1.5, 0.4, 1.7, 0.6, 1.2, 0.4, 0.7...
## $ work_at_home       <dbl> 1.8, 3.9, 1.6, 0.7, 2.3, 2.8, 1.7, 2.7, 2.1, 2.5...
## $ mean_commute       <dbl> 26.5, 26.4, 24.1, 28.8, 34.9, 27.5, 24.6, 24.1, ...
## $ employed           <int> 23986, 85953, 8597, 8294, 22189, 3865, 7813, 474...
## $ private_work       <dbl> 73.6, 81.5, 71.8, 76.8, 82.0, 79.5, 77.4, 74.1, ...
## $ public_work        <dbl> 20.9, 12.3, 20.8, 16.1, 13.5, 15.1, 16.2, 20.8, ...
## $ self_employed      <dbl> 5.5, 5.8, 7.3, 6.7, 4.2, 5.4, 6.2, 5.0, 2.8, 7.9...
## $ family_work        <dbl> 0.0, 0.4, 0.1, 0.4, 0.4, 0.0, 0.2, 0.1, 0.0, 0.5...
## $ unemployment       <dbl> 7.6, 7.5, 17.6, 8.3, 7.7, 18.0, 10.9, 12.3, 8.9,...
## $ land_area          <dbl> 594.44, 1589.78, 884.88, 622.58, 644.78, 622.81,...
counties %>%
  # Select state, county, population, and industry-related columns
  select(state, county, population, professional:production) %>%
  # Arrange service in descending order 
  arrange(desc(service))

Notice that when you select a group of related variables, it’s easy to find the insights you’re looking for.

Select helpers

In the video you learned about the select helper starts_with(). Another select helper is ends_with(), which finds the columns that end with a particular string.

counties %>%
  # Select the state, county, population, and those ending with "work"
  select(state, county, population, ends_with("work")) %>%
  # Filter for counties that have at least 50% of people engaged in public work
  filter(public_work >= 50)
##          state                     county population private_work public_work
## 1       Alaska       Kusilvak Census Area       7914         45.4        53.8
## 2       Alaska Lake and Peninsula Borough       1474         42.2        51.6
## 3       Alaska  Yukon-Koyukuk Census Area       5644         33.3        61.7
## 4   California                     Lassen      32645         42.6        50.5
## 5       Hawaii                    Kalawao         85         25.0        64.1
## 6 North Dakota                      Sioux       4380         32.9        56.8
## 7 South Dakota              Oglala Lakota      14153         29.5        66.2
## 8 South Dakota                       Todd       9942         34.4        55.0
## 9    Wisconsin                  Menominee       4451         36.8        59.1
##   family_work
## 1         0.3
## 2         0.2
## 3         0.0
## 4         0.1
## 5         0.0
## 6         0.1
## 7         0.0
## 8         0.8
## 9         0.4

It looks like only a few counties have more than half the population working for the government.

Video: The rename verb

Renaming a column after count

The rename() verb is often useful for changing the name of a column that comes out of another verb, such ascount(). In this exercise, you’ll rename the n column from count() (which you learned about in Chapter 2) to something more descriptive.

# Rename the n column to num_counties
counties %>%
  count(state) %>%
  rename(num_counties = n)
## # A tibble: 50 x 2
##    state       num_counties
##    <fct>              <int>
##  1 Alabama               67
##  2 Alaska                29
##  3 Arizona               15
##  4 Arkansas              75
##  5 California            58
##  6 Colorado              64
##  7 Connecticut            8
##  8 Delaware               3
##  9 Florida               67
## 10 Georgia              159
## # ... with 40 more rows

Notice the difference between column names in the output from the first step to the second step. Don’t forget, using rename() isn’t the only way to choose a new name for a column!

Renaming a column as part of a select

rename() isn’t the only way you can choose a new name for a column: you can also choose a name as part of a select().

# Select state, county, and poverty as poverty_rate
counties %>%
  select(state, county, poverty_rate = poverty)

As you can see, we were able to select the four columns of interest from our dataset, and rename one of those columns, using only the select() verb!

Video: The transmute verb

Question: Choosing among verbs

Source: DataCamp

Recall, you can think of transmute() as a combination of select() and mutate(), since you are getting back a subset of columns, but you are transforming and changing them at the same time.

Using transmute

As you learned in the video, the transmute verb allows you to control which variables you keep, which variables you calculate, and which variables you drop.

counties %>%
  # Keep the state, county, and populations columns, and add a density column
  transmute(state, county, population, density = population / land_area) %>%
  # Filter for counties with a population greater than one million 
  filter(population > 1000000) %>%
  # Sort density in ascending order 
  arrange(density)
##            state         county population    density
## 1     California San Bernardino    2094769   104.4411
## 2         Nevada          Clark    2035572   257.9472
## 3     California      Riverside    2298032   318.8841
## 4        Arizona       Maricopa    4018143   436.7480
## 5        Florida     Palm Beach    1378806   699.9868
## 6     California      San Diego    3223096   766.1943
## 7     Washington           King    2045756   966.9999
## 8          Texas         Travis    1121645  1132.7459
## 9        Florida   Hillsborough    1302884  1277.0743
## 10       Florida         Orange    1229039  1360.4142
## 11       Florida     Miami-Dade    2639042  1390.6382
## 12      Michigan        Oakland    1229503  1417.0332
## 13    California    Santa Clara    1868149  1448.0653
## 14          Utah      Salt Lake    1078958  1453.5728
## 15         Texas          Bexar    1825502  1472.3928
## 16    California     Sacramento    1465832  1519.5638
## 17       Florida        Broward    1843152  1523.5305
## 18    California   Contra Costa    1096068  1530.9495
## 19      New York        Suffolk    1501373  1646.1521
## 20  Pennsylvania      Allegheny    1231145  1686.3152
## 21 Massachusetts      Middlesex    1556116  1902.7610
## 22      Missouri      St. Louis    1001327  1971.8925
## 23      Maryland     Montgomery    1017859  2071.9776
## 24    California        Alameda    1584983  2144.7092
## 25     Minnesota       Hennepin    1197776  2163.6518
## 26         Texas        Tarrant    1914526  2216.8873
## 27          Ohio       Franklin    1215761  2284.4492
## 28    California    Los Angeles   10038388  2473.8011
## 29         Texas         Harris    4356362  2557.3309
## 30          Ohio       Cuyahoga    1263189  2762.9410
## 31         Texas         Dallas    2485003  2852.1291
## 32      Virginia        Fairfax    1128722  2886.9785
## 33      Michigan          Wayne    1778969  2906.4322
## 34    California         Orange    3116069  3941.5472
## 35      New York         Nassau    1354612  4757.6988
## 36      Illinois           Cook    5236393  5539.2223
## 37  Pennsylvania   Philadelphia    1555072 11596.3609
## 38      New York         Queens    2301139 21202.7919
## 39      New York          Bronx    1428357 33927.7197
## 40      New York          Kings    2595259 36645.8486
## 41      New York       New York    1629507 71375.6899

Looks like San Bernadino is the lowest density county with a population about one million.

counties %>%
  # Keep the state, county, and populations columns, and add a density column
  transmute(state, county, population, density = population / land_area) %>%
  # Filter for counties with a population greater than one million 
  filter(population > 1000000) %>%
  # Sort density in ascending order 
  arrange(density)
##            state         county population    density
## 1     California San Bernardino    2094769   104.4411
## 2         Nevada          Clark    2035572   257.9472
## 3     California      Riverside    2298032   318.8841
## 4        Arizona       Maricopa    4018143   436.7480
## 5        Florida     Palm Beach    1378806   699.9868
## 6     California      San Diego    3223096   766.1943
## 7     Washington           King    2045756   966.9999
## 8          Texas         Travis    1121645  1132.7459
## 9        Florida   Hillsborough    1302884  1277.0743
## 10       Florida         Orange    1229039  1360.4142
## 11       Florida     Miami-Dade    2639042  1390.6382
## 12      Michigan        Oakland    1229503  1417.0332
## 13    California    Santa Clara    1868149  1448.0653
## 14          Utah      Salt Lake    1078958  1453.5728
## 15         Texas          Bexar    1825502  1472.3928
## 16    California     Sacramento    1465832  1519.5638
## 17       Florida        Broward    1843152  1523.5305
## 18    California   Contra Costa    1096068  1530.9495
## 19      New York        Suffolk    1501373  1646.1521
## 20  Pennsylvania      Allegheny    1231145  1686.3152
## 21 Massachusetts      Middlesex    1556116  1902.7610
## 22      Missouri      St. Louis    1001327  1971.8925
## 23      Maryland     Montgomery    1017859  2071.9776
## 24    California        Alameda    1584983  2144.7092
## 25     Minnesota       Hennepin    1197776  2163.6518
## 26         Texas        Tarrant    1914526  2216.8873
## 27          Ohio       Franklin    1215761  2284.4492
## 28    California    Los Angeles   10038388  2473.8011
## 29         Texas         Harris    4356362  2557.3309
## 30          Ohio       Cuyahoga    1263189  2762.9410
## 31         Texas         Dallas    2485003  2852.1291
## 32      Virginia        Fairfax    1128722  2886.9785
## 33      Michigan          Wayne    1778969  2906.4322
## 34    California         Orange    3116069  3941.5472
## 35      New York         Nassau    1354612  4757.6988
## 36      Illinois           Cook    5236393  5539.2223
## 37  Pennsylvania   Philadelphia    1555072 11596.3609
## 38      New York         Queens    2301139 21202.7919
## 39      New York          Bronx    1428357 33927.7197
## 40      New York          Kings    2595259 36645.8486
## 41      New York       New York    1629507 71375.6899

Question: Matching verbs to their definitions

We’ve learned a number of new verbs in this chapter that you can use to modify and change the variables you have.

  • rename: Leaves the column you don’t mention alone; doesn’t allow you to calculate or change values.
  • transmute: Must mention all the columns you keep; allows you to calculate or change values.
  • mutate: Leaves the columns you don’t mention alone; allows you to calculate or change values.

Let’s continue practising using the verbs to gain a better understanding of the differences between them.

Choosing among the four verbs

In this chapter you’ve learned about the four verbs: select, mutate, transmute, and rename. Here, you’ll choose the appropriate verb for each situation. You won’t need to change anything inside the parentheses.

# Change the name of the unemployment column
counties %>%
  rename(unemployment_rate = unemployment)

# Keep the state and county columns, and the columns containing poverty
counties %>%
  select(state, county, contains("poverty"))

# Calculate the fraction_women column without dropping the other columns
counties %>%
  mutate(fraction_women = women / population)

# Keep only the state, county, and employment_rate columns
counties %>%
  transmute(state, county, employment_rate = employed / population)

Now you know which variable to choose depending on whether you want to keep, drop, rename, or change a variable in the dataset.


4: Case Study: The babynames Dataset

Work with a new dataset that represents the names of babies born in the United States each year. Learn how to use grouped mutates and window functions to ask and answer more complex questions about your data. And use a combination of dplyr and ggplot2 to make interesting graphs to further explore your data.

Video: The babynames data

library(babynames) 
## Warning: package 'babynames' was built under R version 3.5.3
library(dplyr) 
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.5.3
babynames <- babynames %>% 
  select(year, name, number = n) %>% 
  group_by(year, name) %>% 
  summarise(number = sum(number)) %>% 
  # ungroup() %>% 
  arrange(year, name)
babynames
## # A tibble: 1,756,284 x 3
## # Groups:   year [138]
##     year name    number
##    <dbl> <chr>    <int>
##  1  1880 Aaron      102
##  2  1880 Ab           5
##  3  1880 Abbie       71
##  4  1880 Abbott       5
##  5  1880 Abby         6
##  6  1880 Abe         50
##  7  1880 Abel         9
##  8  1880 Abigail     12
##  9  1880 Abner       27
## 10  1880 Abraham     81
## # ... with 1,756,274 more rows

Filtering and arranging for one year

The dplyr verbs you’ve learned are useful for exploring data. For instance, you could find out the most common names in a particular year.

babynames %>%
  # Filter for the year 1990
  filter(year == 1990) %>%
  # Sort the number column in descending order 
  arrange(desc(number))
## # A tibble: 22,678 x 3
## # Groups:   year [1]
##     year name        number
##    <dbl> <chr>        <int>
##  1  1990 Michael      65560
##  2  1990 Christopher  52520
##  3  1990 Jessica      46615
##  4  1990 Ashley       45797
##  5  1990 Matthew      44925
##  6  1990 Joshua       43382
##  7  1990 Brittany     36650
##  8  1990 Amanda       34504
##  9  1990 Daniel       33963
## 10  1990 David        33862
## # ... with 22,668 more rows

It looks like the most common names for babies born in the US in 1990 were Michael, Christopher, and Jessica.

Using top_n with babynames

You saw that you could use filter() and arrange() to find the most common names in one year. However, you could also use group_by and top_n to find the most common name in every year.

# Find the most common name in each year
babynames %>%
  group_by(year) %>%
  top_n(1, number)
## # A tibble: 138 x 3
## # Groups:   year [138]
##     year name  number
##    <dbl> <chr>  <int>
##  1  1880 John    9701
##  2  1881 John    8795
##  3  1882 John    9597
##  4  1883 John    8934
##  5  1884 John    9428
##  6  1885 Mary    9166
##  7  1886 Mary    9921
##  8  1887 Mary    9935
##  9  1888 Mary   11804
## 10  1889 Mary   11689
## # ... with 128 more rows

It looks like John was the most common name in 1880, and Mary was the most common name for a while after that.

Visualizing names with ggplot2

The dplyr package is very useful for exploring data, but it’s especially useful when combined with other tidyverse packages like ggplot2. (As of tidyverse 1.3.0, the following packages are included in the core tidyverse: dplyr, ggplot2, tidyr, readr, purrr, tibble, stringr, forcats. To make sure you are able to access the package, install all the packages in the tidyverse by running install.packages("tidyverse"), then run library(tidyverse) to load the core tidyverse and make it available in your current R session.)

# Filter for the names Steven, Thomas, and Matthew 
selected_names <- babynames %>%
  filter(name %in% c("Steven", "Thomas", "Matthew"))

# Plot the names using a different color for each name
ggplot(selected_names, aes(x = year, y = number, color = name)) +
  geom_line()

It looks like names like Steven and Thomas were common in the 1950s, but Matthew became common more recently.

Video: Grouped mutates

Finding the year each name is most common

In an earlier video, you learned how to filter for a particular name to determine the frequency of that name over time. Now, you’re going to explore which year each name was the most common.

To do this, you’ll be combining the grouped mutate approach with a top_n.

# Calculate the fraction of people born each year with the same name
babynames %>%
  group_by(year) %>%
  mutate(year_total = sum(number)) %>%
  ungroup() %>%
  mutate(fraction = number / year_total) %>%
# Find the year each name is most common
  group_by(name) %>%
  top_n(1, fraction)
## # A tibble: 97,310 x 5
## # Groups:   name [97,310]
##     year name     number year_total  fraction
##    <dbl> <chr>     <int>      <int>     <dbl>
##  1  1880 Abbott        5     201484 0.0000248
##  2  1880 Abe          50     201484 0.000248 
##  3  1880 Adelbert     28     201484 0.000139 
##  4  1880 Adella       26     201484 0.000129 
##  5  1880 Agustus       5     201484 0.0000248
##  6  1880 Albert     1493     201484 0.00741  
##  7  1880 Albertus      5     201484 0.0000248
##  8  1880 Alcide        7     201484 0.0000347
##  9  1880 Alonzo      122     201484 0.000606 
## 10  1880 Amos        128     201484 0.000635 
## # ... with 97,300 more rows

Notice that the results are grouped by year, then name, so the first few entries are names that were most popular in the 1880’s that start with the letter A.

Adding the total and maximum for each name

In the video, you learned how you could group by the year and use mutate() to add a total for that year.

In these exercises, you’ll learn to normalize by a different, but also interesting metric: you’ll divide each name by the maximum for that name. This means that every name will peak at 1.

Once you add new columns, the result will still be grouped by name. This splits it into 48,000 groups, which actually makes later steps like mutate slower.

babynames %>%
  group_by(name) %>%
  mutate(name_total = sum(number),
         name_max = max(number)) %>%
  # Ungroup the table 
  ungroup() %>%
  # Add the fraction_max column containing the number by the name maximum 
  mutate(fraction_max = number / name_max)
## # A tibble: 1,756,284 x 6
##     year name    number name_total name_max fraction_max
##    <dbl> <chr>    <int>      <int>    <int>        <dbl>
##  1  1880 Aaron      102     579589    15411     0.00662 
##  2  1880 Ab           5        362       41     0.122   
##  3  1880 Abbie       71      21716      536     0.132   
##  4  1880 Abbott       5       1020       59     0.0847  
##  5  1880 Abby         6      57756     2048     0.00293 
##  6  1880 Abe         50       9158      280     0.179   
##  7  1880 Abel         9      50236     3245     0.00277 
##  8  1880 Abigail     12     357031    15948     0.000752
##  9  1880 Abner       27       7641      202     0.134   
## 10  1880 Abraham     81      88852     2575     0.0315  
## # ... with 1,756,274 more rows

This tells you, for example, that the name Abe was at 18.5% of its peak in the year 1880.

Visualizing the normalized change in popularity

You picked a few names and calculated each of them as a fraction of their peak. This is a type of “normalizing” a name, where you’re focused on the relative change within each name rather than the overall popularity of the name.

In this exercise, you’ll visualize the normalized popularity of each name. Your work from the previous exercise, names_normalized, has been provided for you.

names_normalized <- babynames %>%
                     group_by(name) %>%
                     mutate(name_total = sum(number),
                            name_max = max(number)) %>%
                     ungroup() %>%
                     mutate(fraction_max = number / name_max)
# Filter for the names Steven, Thomas, and Matthew
names_filtered <- names_normalized %>%
  filter(name %in% c("Steven", "Thomas", "Matthew"))
# Visualize these names over time
ggplot(names_filtered, aes(x = year, y = fraction_max, color = name)) +
geom_line()

As you can see, the line for each name hits a peak at 1, although the peak year differs for each name.

Video: Window functions

The code shown at the end of the video (finding the variable difference for all names) contains a mistake. Can you spot it? Below is the corrected code.

# As we did before, but now naming it babynames_fraction
babynames_fraction <- babynames %>%
  group_by(year) %>% 
  mutate(year_total = sum(number)) %>% 
  ungroup() %>% 
  mutate(fraction = number / year_total)
# Just for Matthew
babynames_fraction %>% 
  filter(name == "Matthew") %>% 
  arrange(year) %>% 
# Display change in prevalence from year to year
  mutate(difference = fraction - lag(fraction)) %>% 
# Arange in descending order
  arrange(desc(difference))
## # A tibble: 138 x 6
##     year name    number year_total fraction difference
##    <dbl> <chr>    <int>      <int>    <dbl>      <dbl>
##  1  1981 Matthew  43531    3459182  0.0126    0.00154 
##  2  1983 Matthew  50531    3462826  0.0146    0.00138 
##  3  1971 Matthew  22653    3432585  0.00660   0.000982
##  4  1967 Matthew  13629    3395130  0.00401   0.000894
##  5  1973 Matthew  24658    3017412  0.00817   0.000835
##  6  1974 Matthew  27332    3040409  0.00899   0.000818
##  7  1978 Matthew  34468    3174268  0.0109    0.000741
##  8  1972 Matthew  23066    3143627  0.00734   0.000738
##  9  1968 Matthew  15915    3378876  0.00471   0.000696
## 10  1982 Matthew  46333    3507664  0.0132    0.000625
## # ... with 128 more rows
# For all names
babynames_fraction %>% 
  group_by(name) %>% 
# Display change in prevalence from year to year
  mutate(difference = fraction - lag(fraction)) %>% 
# Arange in descending order
  arrange(name, year)
## # A tibble: 1,756,284 x 6
## # Groups:   name [97,310]
##     year name  number year_total   fraction difference
##    <dbl> <chr>  <int>      <int>      <dbl>      <dbl>
##  1  2007 Aaban      5    3994007 0.00000125   NA      
##  2  2009 Aaban      6    3815638 0.00000157    3.21e-7
##  3  2010 Aaban      9    3690700 0.00000244    8.66e-7
##  4  2011 Aaban     11    3651914 0.00000301    5.74e-7
##  5  2012 Aaban     11    3650462 0.00000301    1.20e-9
##  6  2013 Aaban     14    3637310 0.00000385    8.36e-7
##  7  2014 Aaban     16    3696311 0.00000433    4.80e-7
##  8  2015 Aaban     15    3688687 0.00000407   -2.62e-7
##  9  2016 Aaban      9    3652968 0.00000246   -1.60e-6
## 10  2017 Aaban     11    3546301 0.00000310    6.38e-7
## # ... with 1,756,274 more rows

Using ratios to describe the frequency of a name

In the video, you learned how to find the difference in the frequency of a baby name between consecutive years. What if instead of finding the difference, you wanted to find the ratio?

You’ll start with the babynames_fraction data already, so that you can consider the popularity of each name within each year.

babynames_ratio <- babynames_fraction %>%
  # Arrange the data in order of name, then year 
  arrange(name, year) %>%
  # Group the data by name
  group_by(name) %>%
  # Add a ratio column that contains the ratio between each year 
  mutate(ratio = fraction / lag(fraction))

Notice that the first observation for each name is missing a ratio, since there is no previous year.

Biggest jumps in a name

Previously, you added a ratio column to describe the ratio of the frequency of a baby name between consecutive years to describe the changes in the popularity of a name. Now, you’ll look at a subset of that data, called babynames_ratios_filtered, to look further into the names that experienced the biggest jumps in popularity in consecutive years.

babynames_ratios_filtered <- babynames_ratio %>%
                     filter(fraction >= 0.00001)
babynames_ratios_filtered %>%
  # Extract the largest ratio from each name 
  top_n(1, ratio) %>%
  # Sort the ratio column in descending order 
  arrange(desc(ratio)) %>%
  # Filter for fractions greater than or equal to 0.001
  filter(fraction >= 0.001)
## # A tibble: 155 x 6
## # Groups:   name [155]
##     year name    number year_total fraction ratio
##    <dbl> <chr>    <int>      <int>    <dbl> <dbl>
##  1  1957 Tammy     4398    4200007  0.00105 16.3 
##  2  1912 Woodrow   1854     988064  0.00188  9.99
##  3  1931 Marlene   2599    2104071  0.00124  8.97
##  4  1898 Dewey     1219     381458  0.00320  6.48
##  5  2010 Bentley   4001    3690700  0.00108  6.21
##  6  1884 Grover     809     243462  0.00332  5.68
##  7  1984 Jenna     5898    3487820  0.00169  5.01
##  8  1991 Mariah    5200    3894329  0.00134  4.78
##  9  1943 Cheryl    2894    2822127  0.00103  4.75
## 10  1989 Ethan     4067    3843559  0.00106  4.37
## # ... with 145 more rows

Some of these can be interpreted: for example, Grover Cleveland was a president elected in 1884.

Video: Contratulations!

You’ll find all of these skills valuable is these other DataCamp courses:

  • Exploratory Data Analysis in R: Case Study
  • Working with Data in the Tidyverse
  • Machine Learning in the Tidyverse
  • Categorical Data in the Tidyverse

Module 5: Cleaning Data in R

It’s commonly said that data scientists spend 80% of their time cleaning and manipulating data and only 20% of their time analyzing it. The time spent cleaning is vital since analyzing dirty data can lead you to draw inaccurate conclusions.

In this course, you’ll learn how to clean dirty data. Using R, you’ll learn how to identify values that don’t look right and fix dirty data by converting data types, filling in missing values, and using fuzzy string matching. As you learn, you’ll brush up on your skills by working with real-world datasets, including bike-share trips, customer asset portfolios, and restaurant reviews—developing the skills you need to go from raw data to awesome insights as quickly and accurately as possible!

1: Common Data Problems

In this chapter, you’ll learn how to overcome some of the most common dirty data problems. You’ll convert data types, apply range constraints to remove future data points, and remove duplicated data points to avoid double-counting.

Video: Data type constraints

Errors can be introduced by typos and misspellings. Dirty data can the data science workflow before we even access the data, and if we don’t address these errors early on, they can follow us all the way through the workflow.

Errors appearing throughout the data science workflow. Source: DataCamp

A quick reminder of data type constraints:

Data types vs. data types in R. Source: DataCamp

Question: Common data types

Solution. Source: DataCamp

Correctly identifying what type your data is is one of the easiest ways to avoid hampering your analysis due to data type constraints in the long run.

Converting data types

Throughout this chapter, you’ll be working with San Francisco bike share ride data called bike_share_rides. It contains information on start and end stations of each trip, the trip duration, and some user information.

Before beginning to analyze any dataset, it’s important to take a look at the different types of columns you’ll be working with, which you can do using glimpse().

In this exercise, you’ll take a look at the data types contained in bike_share_rides and see how an incorrect data type can flaw your analysis.

dplyr was previously loaded, so we just load assertive. bike_share_rides is not available so code requiring it will not be loaded.

# Load assertive package
library(assertive)
# Glimpse at bike_share_rides
glimpse(bike_share_rides)

# Summary of user_birth_year
summary(bike_share_rides$user_birth_year)

Question

The summary statistics of user_birth_year don’t seem to offer much useful information about the different birth years in our dataset. Why do you think that is?

  • The user_birth_year column is not of the correct type and should be converted to a character.

  • The user_birth_year column has an infinite set of possible values and should be converted to a factor.

  • The user_birth_year column represents groupings of data and should be converted to a factor. [Correct]

# Convert user_birth_year to factor: user_birth_year_fct
bike_share_rides <- bike_share_rides %>%
  mutate(user_birth_year_fct = as.factor(user_birth_year))

# Assert user_birth_year_fct is a factor
assert_is_factor(bike_share_rides$user_birth_year_fct)

# Summary of user_birth_year_fct
summary(bike_share_rides$user_birth_year_fct)

Looking at the new summary statistics, more riders were born in 1988 than any other year.

Trimming strings

In the previous exercise, you were able to identify the correct data type and convert user_birth_year to the correct type, allowing you to extract counts that gave you a bit more insight into the dataset.

Another common dirty data problem is having extra bits like percent signs or periods in numbers, causing them to be read in as characters. In order to be able to crunch these numbers, the extra bits need to be removed and the numbers need to be converted from character to numeric. In this exercise, you’ll need to convert the duration column from character to numeric, but before this can happen, the word "minutes" needs to be removed from each value.

# Load stringr package
library(stringr)
## Warning: package 'stringr' was built under R version 3.5.3
bike_share_rides <- bike_share_rides %>%
  # Remove 'minutes' from duration: duration_trimmed
  mutate(duration_trimmed = str_remove(duration, "minutes"),
         # Convert duration_trimmed to numeric: duration_mins
         duration_mins = as.numeric(duration_trimmed))

# Glimpse at bike_share_rides
glimpse(bike_share_rides)

# Assert duration_mins is numeric
assert_is_numeric(bike_share_rides$duration_mins)

# Calculate mean duration
mean(bike_share_rides$duration_mins)

By removing characters and converting to a numeric type, you were able to figure out that the average ride duration is about 13 minutes - not bad for a city like San Francisco!

Video: Range constraints

What to do about about values outside of a variable’s allowed range?

  • We could remove these rows from the data frame, but this should only be done when a small proportion of the values are out of range; otherwise we would significantly increase the amount of bias in our dataset.
  • Treat these values as missing (NA).
  • Replace them with the range limit, e.g. replace a rating of 6 on a 1 to 5 rating scale with 5.
  • Replace with other value based on domain knowledge and/or knowledge of dataset, e.g. replace with average rating.

Ride duration constraints

Values that are out of range can throw off an analysis, so it’s important to catch them early on. In this exercise, you’ll be examining the duration_min column more closely. Bikes are not allowed to be kept out for more than 24 hours, or 1440 minutes at a time, but issues with some of the bikes caused inaccurate recording of the time they were returned.

In this exercise, you’ll replace erroneous data with the range limit (1440 minutes), however, you could just as easily replace these values with NAs.

# Create breaks
breaks <- c(min(bike_share_rides$duration_min), 0, 1440, max(bike_share_rides$duration_min))

# Create a histogram of duration_min
ggplot(bike_share_rides, aes(duration_min)) +
  geom_histogram(breaks = breaks)

# duration_min_const: replace vals of duration_min > 1440 with 1440
bike_share_rides <- bike_share_rides %>%
  mutate(duration_min_const = replace(duration_min, duration_min > 1440, 1440))

# Make sure all values of duration_min_const are between 0 and 1440
assert_all_are_in_closed_range(bike_share_rides$duration_min_const, lower = 0, upper = 1440)

The method of replacing erroneous data with the range limit works well, but you could just as easily replace these values with NAs or something else instead.

Back to the future

Something has gone wrong and it looks like you have data with dates from the future, which is way outside of the date range you expected to be working with. To fix this, you’ll need to remove any rides from the dataset that have a date in the future. Before you can do this, the date column needs to be converted from a character to a Date. Having these as Date objects will make it much easier to figure out which rides are from the future, since R makes it easy to check if one Date object is before (<) or after (>) another.

# Convert date to Date type
bike_share_rides <- bike_share_rides %>%
  mutate(date = as.Date(date))

# Make sure all dates are in the past
assert_all_are_in_past(bike_share_rides$date)

# Filter for rides that occurred before or on today's date
bike_share_rides_past <- bike_share_rides %>%
  filter(date <= today())

# Make sure all dates from bike_share_rides_past are in the past
assert_all_are_in_past(bike_share_rides_past$date)

Handling data from the future like this is much easier than trying to verify the data’s correctness by time traveling.

Video: Uniqueness constraints

Duplicate entries. Source: DataCamp

Source of duplicates. Source: DataCamp

  • Full duplicates: Two rows have the exact same entries in every column.
  • Partial duplicates: Two rows have the exact same entries in some columns.

Full duplicates

You’ve been notified that an update has been made to the bike sharing data pipeline to make it more efficient, but that duplicates are more likely to be generated as a result. To make sure that you can continue using the same scripts to run your weekly analyses about ride statistics, you’ll need to ensure that any duplicates in the dataset are removed first.

When multiple rows of a data frame share the same values for all columns, they’re full duplicates of each other. Removing duplicates like this is important, since having the same value repeated multiple times can alter summary statistics like the mean and median. Each ride, including its ride_id should be unique.

# Count the number of full duplicates
sum(duplicated(bike_share_rides))

# Remove duplicates
bike_share_rides_unique <- distinct(bike_share_rides)

# Count the full duplicates in bike_share_rides_unique
sum(duplicated(bike_share_rides_unique))

Removing full duplicates will ensure that summary statistics aren’t altered by repeated data points.

Removing partial duplicates

Now that you’ve identified and removed the full duplicates, it’s time to check for partial duplicates. Partial duplicates are a bit tricker to deal with than full duplicates. In this exercise, you’ll first identify any partial duplicates and then practice the most common technique to deal with them, which involves dropping all partial duplicates, keeping only the first.

# Find duplicated ride_ids
bike_share_rides %>% 
  # Count the number of occurrences of each ride_id
  count(ride_id) %>% 
  # Filter for rows with a count > 1
  filter(n > 1)

# Remove full and partial duplicates
bike_share_rides_unique <- bike_share_rides %>%
  # Only based on ride_id instead of all cols
  distinct(ride_id, .keep_all = TRUE)

# Find duplicated ride_ids in bike_share_rides_unique
bike_share_rides_unique %>%
  # Count the number of occurrences of each ride_id
  count(ride_id) %>%
  # Filter for rows with a count > 1
  filter(n > 1)

It’s important to consider the data you’re working with before removing partial duplicates, since sometimes it’s expected that there will be partial duplicates in a dataset, such as if the same customer makes multiple purchases.

Aggregating partial duplicates

Another way of handling partial duplicates is to compute a summary statistic of the values that differ between partial duplicates, such as mean, median, maximum, or minimum. This can come in handy when you’re not sure how your data was collected and want an average, or if based on domain knowledge, you’d rather have too high of an estimate than too low of an estimate (or vice versa).

bike_share_rides %>%
  # Group by ride_id and date
  group_by(ride_id, date) %>%
  # Add duration_min_avg column
  mutate(duration_min_avg = mean(duration_min) ) %>%
  # Remove duplicates based on ride_id and date, keep all cols
  distinct(ride_id, date, .keep_all = TRUE) %>%
  # Remove duration_min column
  select(-duration_min)

Aggregation of partial duplicates allows you to keep some information about all data points instead of keeping information about just one data point.


2: Categorical and Text Data

Categorical and text data can often be some of the messiest parts of a dataset due to their unstructured nature. In this chapter, you’ll learn how to fix whitespace and capitalization inconsistencies in category labels, collapse multiple categories into one, and reformat strings for consistency.

Video: Checking membership

In R, categories are stored as factors in R. They are stored as numbers, and each number has a corresponding label.

Factors. Source: DataCamp

  • How many levels do the categorical variables above have?

Values that don’t belong. Source: DataCamp

How do we end up with values outside those allowed by our factors?

Source: DataCamp

Filtering joins are a type of join thaat keeps or removes observations from the first table, but doesn’t add any new columns.

Semi-joins vs. anti-joins. Source: DataCamp

Values that aren’t factors. Source: DataCamp

Question: Members only

So far in the course, you’ve learned about a number of different problems you can run into when you have dirty data, including

  • data type constraints,
  • range constraints,
  • uniqueness constraints, and
  • membership constraints.

It’s important to be able to correctly identify the type of problem you’re dealing with so that you can treat it correctly. In this exercise, you’ll practice identifying these problems by mapping dirty data scenarios to their constraint type.

Solution. Source: DataCamp

Being able to identify what kinds of errors are in your data is important so that you know how to go about fixing them.

Not a member

Now that you’ve practiced identifying membership constraint problems, it’s time to fix these problems in a new dataset. Throughout this chapter, you’ll be working with a dataset called sfo_survey, containing survey responses from passengers taking flights from San Francisco International Airport (SFO). Participants were asked questions about the airport’s cleanliness, wait times, safety, and their overall satisfaction.

There were a few issues during data collection that resulted in some inconsistencies in the dataset. In this exercise, you’ll be working with the dest_size column, which categorizes the size of the destination airport that the passengers were flying to. A data frame called dest_sizes is available that contains all the possible destination sizes. Your mission is to find rows with invalid dest_sizes and remove them from the data frame.

# Count the number of occurrences of dest_size
sfo_survey %>%
  count(dest_size)

# Find bad dest_size rows
sfo_survey %>% 
  # Join with dest_sizes data frame to get bad dest_size rows
  anti_join(dest_sizes, by = "dest_size") %>%
  # Select id, airline, destination, and dest_size cols
  select(id, airline, destination, dest_size)

# Remove bad dest_size rows
sfo_survey %>% 
  # Join with dest_sizes
  semi_join(dest_sizes, by = "dest_size") %>%
  # Count the number of each dest_size
  count(dest_size)

Anti-joins can help you identify the rows that are causing issues, and semi-joins can remove the issue-causing rows. In the next lesson, you’ll learn about other ways to deal with bad values so that you don’t have to lose rows of data.

Video: Categorical data problems

Categorical data problems Source: DataCamp

Identifying inconsistency

In the video exercise, you learned about different kinds of inconsistencies that can occur within categories, making it look like a variable has more categories than it should.

In this exercise, you’ll continue working with the sfo_survey dataset. You’ll examine the dest_size column again as well as the cleanliness column and determine what kind of issues, if any, these two categorical variables face.

# Count dest_size
sfo_survey %>%
  count(dest_size)

# Count cleanliness
sfo_survey %>%
  count(cleanliness)

In the next exercise, you’ll fix these inconsistencies to get more accurate counts.

Correcting inconsistency

Now that you’ve identified that dest_size has whitespace inconsistencies and cleanliness has capitalization inconsistencies, you’ll use the new tools at your disposal to fix the inconsistent values in sfo_survey instead of removing the data points entirely, which could add bias to your dataset if more than 5% of the data points need to be dropped.

# Add new columns to sfo_survey
sfo_survey <- sfo_survey %>%
  # dest_size_trimmed: dest_size without whitespace
  mutate(dest_size_trimmed = str_trim(dest_size),
         # cleanliness_lower: cleanliness converted to lowercase
         cleanliness_lower = str_to_lower(cleanliness))

# Count values of dest_size_trimmed
sfo_survey %>%
  count(dest_size_trimmed)

# Count values of cleanliness_lower
sfo_survey %>%
  count(cleanliness_lower)

You were able to convert seven-category data into four-category data, which will help your analysis go more smoothly.

Collapsing categories

One of the tablets that participants filled out the sfo_survey on was not properly configured, allowing the response for dest_region to be free text instead of a dropdown menu. This resulted in some inconsistencies in the dest_region variable that you’ll need to correct in this exercise to ensure that the numbers you report to your boss are as accurate as possible.

# Count categories of dest_region
sfo_survey %>%
  count(dest_region)

# Categories to map to Europe
europe_categories <- c("EU", "eur", "Europ")

# Add a new col dest_region_collapsed
sfo_survey %>%
  # Map all categories in europe_categories to Europe
  mutate(dest_region_collapsed = fct_collapse(dest_region, 
                                     Europe = europe_categories)) %>%
  # Count categories of dest_region_collapsed
  count(dest_region_collapsed)

You’ve reduced the number of categories from 12 to 9, and you can now be confident that 401 of the survey participants were heading to Europe.

Video: Cleaning text data

Text data. Source: DataCamp

Unstructured data problems. Source: DataCamp

More complex text problems. Source: DataCamp

Detecting inconsistent text data

You’ve recently received some news that the customer support team wants to ask the SFO survey participants some follow-up questions. However, the auto-dialer that the call center uses isn’t able to parse all of the phone numbers since they’re all in different formats. After some investigation, you found that some phone numbers are written with hyphens (-) and some are written with parentheses ((,)). In this exercise, you’ll figure out which phone numbers have these issues so that you know which ones need fixing.

# Filter for rows with "-" in the phone column
sfo_survey %>%
  filter(str_detect(phone, "-"))

# Filter for rows with "(" or ")" in the phone column
sfo_survey %>%
  filter(str_detect(phone, fixed("(")) | str_detect(phone, fixed(")")))

Now that you’ve identified the inconsistencies in the phone column, it’s time to remove unnecessary characters to make the follow-up survey go as smoothly as possible.

Replacing and removing

In the last exercise, you saw that the phone column of sfo_data is plagued with unnecessary parentheses and hyphens. The customer support team has requested that all phone numbers be in the format “123 456 7890”. In this exercise, you’ll use your new stringr skills to fulfill this request.

# Remove parentheses from phone column
phone_no_parens <- sfo_survey$phone %>%
  # Remove "("s
  str_remove_all(fixed("(")) %>%
  # Remove ")"s
  str_remove_all(fixed(")"))

# Add phone_no_parens as column
sfo_survey %>%
  mutate(sfo_survey, phone_no_parens)

# Add phone_no_parens as column
sfo_survey %>%
  mutate(phone_no_parens = phone_no_parens,
  # Replace all hyphens in phone_no_parens with spaces
         phone_clean = str_replace_all(phone_no_parens, "-", " "))

Now that your phone numbers are all in a single format, the machines in the call center will be able to auto-dial the numbers, making it easier to ask participants follow-up questions.

Invalid phone numbers

The customer support team is grateful for your work so far, but during their first day of calling participants, they ran into some phone numbers that were invalid. In this exercise, you’ll remove any rows with invalid phone numbers so that these faulty numbers don’t keep slowing the team down.

# Check out the invalid numbers
sfo_survey %>%
  filter(str_length(phone) != 12)

# Remove rows with invalid numbers
sfo_survey %>%
  filter(str_length(phone) == 12)

Thanks to your savvy string skills, the follow-up survey will be done in no time!


3: Advanced Data Problems

In this chapter, you’ll dive into more advanced data cleaning problems, such as ensuring that weights are all written in kilograms instead of pounds. You’ll also gain invaluable skills that will help you verify that values have been added correctly and that missing values don’t negatively impact your analyses.

Uniformity. Source: DataCamp

Sources of uniformity issues. Source: DataCamp

What to do about a lack of uniformity? Source: DataCamp

Date uniformity. Source: DataCamp

Ambiguous dates. Source: DataCamp

Date uniformity

In this chapter, you work at an asset management company and you’ll be working with the accounts dataset, which contains information about each customer, the amount in their account, and the date their account was opened. Your boss has asked you to calculate some summary statistics about the average value of each account and whether the age of the account is associated with a higher or lower account value. Before you can do this, you need to make sure that the accounts dataset you’ve been given doesn’t contain any uniformity problems. In this exercise, you’ll investigate the date_opened column and clean it up so that all the dates are in the same format.

# Check out the accounts data frame
head(accounts)

# Define the date formats
formats <- c("%Y-%m-%d", "%B %d, %Y")

# Convert dates to the same format
accounts %>%
  mutate(date_opened_clean = parse_date_time(date_opened, orders = formats))

Now that the date_opened dates are in the same format, you’ll be able to use them for some plotting in the next exercise.

Currency uniformity

Now that your dates are in order, you’ll need to correct any unit differences. When you first plot the data, you’ll notice that there’s a group of very high values, and a group of relatively lower values. The bank has two different offices - one in New York, and one in Tokyo, so you suspect that the accounts managed by the Tokyo office are in Japanese yen instead of U.S. dollars. Luckily, you have a data frame called account_offices that indicates which office manages each customer’s account, so you can use this information to figure out which totals need to be converted from yen to dollars.

The formula to convert yen to dollars is USD = JPY / 104.

# Scatter plot of opening date and total amount
accounts %>%
  ggplot(aes(x = date_opened, y = total)) +
  geom_point()

# Left join accounts and account_offices by id
accounts %>%
  left_join(account_offices, by="id") %>%
  # Convert totals from the Tokyo office to USD
  mutate(total_usd = ifelse(office == "Tokyo", total / 104, total)) %>%
  # Scatter plot of opening date vs total_usd
  ggplot(aes(x = date_opened, y = total_usd)) +
    geom_point()

The points in your last scatter plot all fall within a much smaller range now and you’ll be able to accurately assess the differences between accounts from different countries.

Video: Cross field validation

  • Cross field validation
  • Does this value make sense based on other values?

Link: https://www.buzzfeednews.com/article/katienotopoulos/graphs-that-lied-to-us

What to do with nonsense data. Source: DataCamp

(Note, impute means to assign (a value) to something by inference from the value of the products or processes to which it contributes.)

Validating totals

In this lesson, you’ll continue to work with the accounts data frame, but this time, you have a bit more information about each account. There are three different funds that account holders can store their money in. In this exercise, you’ll validate whether the total amount in each account is equal to the sum of the amount in fund_A, fund_B, and fund_C. If there are any accounts that don’t match up, you can look into them further to see what went wrong in the bookkeeping that led to inconsistencies.

# Find invalid totals
accounts %>%
  # theoretical_total: sum of the three funds
  mutate(theoretical_total = fund_A + fund_B + fund_C) %>%
  # Find accounts where total doesn't match theoretical_total
  filter(theoretical_total != total)

By using cross field validation, you’ve been able to detect values that don’t make sense. How you choose to handle these values will depend on the dataset.

Validating age

Now that you found some inconsistencies in the total amounts, you’re suspicious that there may also be inconsistencies in the acct_agecolumn, and you want to see if these inconsistencies are related. Using the skills you learned from the video exercise, you’ll need to validate the age of each account and see if rows with inconsistent acct_ages are the same ones that had inconsistent totals.

# Find invalid acct_age
accounts %>%
  # theoretical_age: age of acct based on date_opened
  mutate(theoretical_age = floor(as.numeric(date_opened %--% today(), "years"))) %>%
  # Filter for rows where acct_age is different from theoretical_age
  filter(theoretical_age != acct_age)

There are three accounts that all have ages off by one year, but none of them are the same as the accounts that had total inconsistencies, so it looks like these two bookkeeping errors may not be related.

Video: Completeness

What is missing data. Source: DataCamp

Types of missingness. Source: DataCamp

Dealing with missing data. Source: DataCamp

Question: Types of missingness

You just learned about the three flavors of missing data: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). In this exercise, you’ll solidify your new knowledge by mapping examples to the types of missingness.

Question types of missingness. Source: DataCamp

Visualizing missing data

Dealing with missing data is one of the most common tasks in data science. There are a variety of types of missingness, as well as a variety of types of solutions to missing data.

You just received a new version of the accounts data frame containing data on the amount held and amount invested for new and existing customers. However, there are rows with missing inv_amount values.

You know for a fact that most customers below 25 do not have investment accounts yet, and suspect it could be driving the missingness.

# Visualize the missing values by column
vis_miss(accounts)

accounts %>%
  # missing_inv: Is inv_amount missing?
  mutate(missing_inv = is.na(inv_amount)) %>%
  # Group by missing_inv
  group_by(missing_inv) %>%
  # Calculate mean age for each missing_inv group
  summarize(avg_age = mean(age, na.rm=TRUE))

Since the average age for missing_inv = TRUE is 22 and the average age for missing_inv = FALSE is 44, it is likely that the inv_amount variable is missing mostly in young customers.

# Sort by age and visualize missing vals
accounts %>%
  arrange(age) %>%
  vis_miss()

Investigating summary statistics based on missingness is a great way to determine if data is missing completely at random or missing at random.

Treating missing data

In this exercise, you’re working with another version of the accounts data that contains missing values for both the cust_id and acct_amount columns.

You want to figure out how many unique customers the bank has, as well as the average amount held by customers. You know that rows with missing cust_id don’t really help you, and that on average, the acct_amount is usually 5 times the amount of inv_amount.

In this exercise, you will drop rows of accounts with missing cust_ids, and impute missing values of inv_amount with some domain knowledge.

We’ll need to install and load the assertive package.

# Create accounts_clean
accounts_clean <- accounts %>%
  # Filter to remove rows with missing cust_id
  filter(!is.na(cust_id)) %>%
  # Add new col acct_amount_filled with replaced NAs
  mutate(acct_amount_filled = ifelse(is.na(acct_amount), inv_amount * 5, acct_amount))

# Assert that cust_id has no missing vals
assert_all_are_not_na(accounts_clean$cust_id)
# or
sum(is.na(accounts_clean$cust_id))

# Assert that acct_amount_filled has no missing vals
assert_all_are_not_na(accounts_clean$acct_amount_filled)

Since your assertions passed, there’s no missing data left, and you can definitely bank on nailing your analysis!


4: Record Linkage

Record linkage is a powerful technique used to merge multiple datasets together, used when values have typos or different spellings. In this chapter, you’ll learn how to link records by calculating the similarity between strings—you’ll then use your new skills to join two restaurant review datasets into one clean master dataset.

Minimum edit distance. Source: DataCamp

Edit distance = 1. Source: DataCamp

Edit distance = 4. Source: DataCamp

Types of edit distance. Source: DataCamp

Comparing strings to clean data. Source: DataCamp

Calculating distance

In the video exercise, you saw how to use Damerau-Levenshtein distance to identify how similar two strings are. As a reminder, Damerau-Levenshtein distance is the minimum number of steps needed to get from String A to String B, using these operations:

  • Insertion of a new character.
  • Deletion of an existing character.
  • Substitution of an existing character.
  • Transposition of two existing consecutive characters.

Substituting and inserting is the best way to get from "puffin" to "muffins". In the next exercise, you’ll calculate string distances using R functions.

Small distance, small difference

In the video exercise, you learned that there are multiple ways to calculate how similar or different two strings are. Now you’ll practice using the stringdist package to compute string distances using various methods. It’s important to be familiar with different methods, as some methods work better on certain datasets, while others work better on other datasets.

library(stringdist)
## 
## Attaching package: 'stringdist'
## The following object is masked from 'package:tidyr':
## 
##     extract
# Calculate Damerau-Levenshtein distance
stringdist("las angelos", "los angeles", method = "dl")
## [1] 2
# Calculate LCS distance
stringdist("las angelos", "los angeles", method = "lcs")
## [1] 4

The Jaccard method 1. Count the total number of distinct elements in the two strings. 2. Count the number of those elements which appear in both strings. 3. Divide the latter by the former and multiply by 100.

# Calculate Jaccard distance
stringdist("las angelos", "los angeles", method = "jaccard")
## [1] 0

As there are no elements in the first string that aren’t in the second string, and vice versa, the Jaccard method finds no difference between the two strings. In this way, the Jaccard method treats the strings more as sets of letters rather than sequences of letters.

In the next exercise, you’ll use Damerau-Levenshtein distance to map typo-ridden cities to their true spellings.

Fixing typos with string distance

In this chapter, one of the datasets you’ll be working with, zagat, is a set of restaurants in New York, Los Angeles, Atlanta, San Francisco, and Las Vegas. The data is from Zagat, a company that collects restaurant reviews, and includes the restaurant names, addresses, phone numbers, as well as other restaurant information.

The city column contains the name of the city that the restaurant is located in. However, there are a number of typos throughout the column. Your task is to map each city to one of the five correctly-spelled cities contained in the cities data frame.

We’ll need to install and load fuzzyjoin package.

library(fuzzyjoin)
# Count the number of each city variation
zagat %>%
  count(city)

# Join zagat and cities and look at results
zagat %>%
  # Left join based on stringdist using city and city_actual cols
  stringdist_left_join(cities, by = c("city" = "city_actual")) %>%
  # Select the name, city, and city_actual cols
  select(name, city, city_actual)

Now that you’ve created consistent spelling for each city, it will be much easier to compute summary statistics by city.

Video: Generating and comparing pairs

When joins wont work. Source: DataCamp

What is record linkage? Source: DataCamp

Pair blocking

Zagat and Fodor’s are both companies that gather restaurant reviews. The zagat and fodors datasets both contain information about various restaurants, including addresses, phone numbers, and cuisine types. Some restaurants appear in both datasets, but don’t necessarily have the same exact name or phone number written down. In this chapter, you’ll work towards figuring out which restaurants appear in both datasets.

The first step towards this goal is to generate pairs of records so that you can compare them. In this exercise, you’ll first generate all possible pairs, and then use your newly-cleaned city column as a blocking variable.

# Load reclin
library(reclin)

# Generate all possible pairs
pair_blocking(zagat, fodors)

# Generate all possible pairs
pair_blocking(zagat, fodors, blocking_var = "city")

By using city as a blocking variable, you were able to reduce the number of pairs you’ll need to compare from 165,230 pairs to 40,532.

Comparing pairs

Now that you’ve generated the pairs of restaurants, it’s time to compare them. You can easily customize how you perform your comparisons using the by and default_comparator arguments. There’s no right answer as to what each should be set to, so in this exercise, you’ll try a couple options out.

# Generate pairs
pair_blocking(zagat, fodors, blocking_var = "city") %>%
  # Compare pairs by name using lcs()
  compare_pairs(by ="name",
      default_comparator = lcs())

# Generate pairs
pair_blocking(zagat, fodors, blocking_var = "city") %>%
  # Compare pairs by name using lcs()
  compare_pairs(by = c("name", "phone", "addr"),
      default_comparator = jaro_winkler())

Choosing a comparator and the columns to compare is highly dataset-dependent, so it’s best to try out different combinations to see which works best on the dataset you’re working with. Next, you’ll build on your string comparison skills and learn about record linkage!

Video: Scoring and linking

Score then select or select then score?

Record linkage requires a number of steps that can be difficult to keep straight. In this exercise, you’ll solidify your knowledge of the record linkage process so that it’s a breeze when you code it yourself!

Question order for record linkage. Source: DataCamp

Putting it together

During this chapter, you’ve cleaned up the city column of zagat using string similarity, as well as generated and compared pairs of restaurants from zagat and fodors. The end is near - all that’s left to do is score and select pairs and link the data together, and you’ll be able to begin your analysis in no time!

# Create pairs
pair_blocking(zagat, fodors, blocking_var = "city") %>%
  # Compare pairs
  compare_pairs(by = "name", default_comparator = jaro_winkler()) %>%
  # Score pairs
  score_problink() %>%
  # Select pairs
  select_n_to_m() %>%
  # Link data 
  link()

Now that your two datasets are merged, you can use the data to figure out if there are certain characteristics that make a restaurant more likely to be reviewed by Zagat or Fodor’s.

Video: Congratulations!

Module 5, chapter 1 summary. Source: DataCamp

Module 5, chapter 2 summary. Source: DataCamp

Module 5, chapter 3 summary. Source: DataCamp

Module 5, chapter 4 summary. Source: DataCamp

Module 5 - further courses. Source: DataCamp


Module 6: Introduction to Data Visualization with ggplot2

The ability to produce meaningful and beautiful data visualizations is an essential part of your skill set as a data scientist. This course, the first R data visualization tutorial in the series, introduces you to the principles of good visualizations and the grammar of graphics plotting concepts implemented in the ggplot2 package. ggplot2 has become the go-to tool for flexible and professional plots in R. Here, we’ll examine the first three essential layers for making a plot - Data, Aesthetics and Geometries. By the end of the course you will be able to make complex exploratory plots.

1: Introduction

In this chapter we’ll get you into the right frame of mind for developing meaningful visualizations with R. You’ll understand that as a communications tool, visualizations require you to think about your audience first. You’ll also be introduced to the basics of ggplot2 - the 7 different grammatical elements (layers) and aesthetic mappings.

Video: Introduction

View slides.

Data viz is rooted in statistics and graphical data analysis, but it’s also a creative process than involves some amount of trial and error.

Question: Explore and explain

In this video we made the distinction between plots for exploring and plots for explaining data. Exploratory plots are typically meant for a specialist audience, data-heavy, rough first drafts and part of our data science toolkit as graphical data analysis. They are not typically pretty!

You’re not concerned with beautiful at this point. However, the plots should be meaningful and conform to best practices so that you do not mislead yourself!

Drawing your first plot

To get a first feel for ggplot2, let’s try to run some basic ggplot2 commands. The mtcars dataset contains information on 32 cars from a 1973 issue of Motor Trend magazine. This dataset is small, intuitive, and contains a variety of continuous and categorical variables.

# ggplot2 package loaded earlier
# Explore the mtcars data frame with str()
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
# Execute the following command
ggplot(mtcars, aes(cyl, mpg)) +
  geom_point()

Notice that ggplot2 treats cyl as a continuous variable. You get a plot, but it’s not quite right, because it gives the impression that there is such a thing as a 5 or 7-cylinder car, which there is not.

Data columns types affect plot types

The plot from the previous exercise wasn’t really satisfying. Although cyl (the number of cylinders) is categorical, you probably noticed that it is classified as numeric in mtcars. This is really misleading because the representation in the plot doesn’t match the actual data type. You’ll have to explicitly tell ggplot2 that cyl is a categorical variable.

# Change the command below so that cyl is treated as factor
ggplot(mtcars, aes(factor(cyl), mpg)) +
  geom_point()

Notice that ggplot2 treats cyl as a factor. This time the x-axis does not contain variables like 5 or 7, only the values that are present in the dataset.

Video: The grammar of graphics

View slides.

Mapping data columns to aesthetics

Let’s dive a little deeper into the three main topics in this course: The data, aesthetics, and geom layers. We’ll get to making pretty plots in the last chapter with the themes layer.

We’ll continue working on the 32 cars in the mtcars data frame.

Consider how the examples and concepts we discuss throughout these courses apply to your own data-sets!

# Edit to add a color aesthetic mapped to disp
ggplot(mtcars, aes(wt, mpg, color=disp)) +
  geom_point()

# Change the color aesthetic to a size aesthetic
ggplot(mtcars, aes(wt, mpg, size = disp)) +
  geom_point()

Notice that a legend for the color and size scales was automatically generated.

Question: Understanding variables

In the previous exercise you saw that disp can be mapped onto a color gradient or onto a continuous size scale.

Another argument of aes() is the shape of the points. There are a finite number of shapes which ggplot() can automatically assign to the points. However, if you try this command:

ggplot(mtcars, aes(wt, mpg, shape = disp)) +
  geom_point()
## Error: A continuous variable can not be mapped to shape

it gives an error.

The error message ‘A continuous variable can not be mapped to shape’, means that shape doesn’t exist on a continuous scale here.

Video: ggplot2 layers

View slides.

Adding geometries

The diamonds dataset contains details of 1,000 diamonds. Among the variables included are carat (a measurement of the diamond’s size) and price.

You’ll use two common geom layer functions:

  • geom_point() adds points (as in a scatter plot).
  • geom_smooth() adds a smooth trend curve.

As you saw previously, these are added using the + operator.

ggplot(data, aes(x, y)) +
  geom_*()

Where * is the specific geometry needed.

# Explore the diamonds data frame with str()
str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
# Add geom_smooth() with +
ggplot(diamonds, aes(carat, price)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

If you had executed the command without adding a +, it would produce an error message ‘No layers in plot’ because you are missing the third essential layer - the geom layer.

Changing one geom or every geom

If you have multiple geoms, then mapping an aesthetic to data variable inside the call to ggplot() will change all the geoms. It is also possible to make changes to individual geoms by passing arguments to the indivisual geom_*() functions.

FOr example, geom_point() has an argument alpha that controls the opacity of the points. A value of 1 (the default) means that the points are totally opaque; a value of 0 means the points are totally transparent (and therefore invisible). Values in between specify transparency.

We’ll amend our previous plot.

# Make the points 40% opaque
ggplot(diamonds, aes(carat, price, color = clarity)) +
  geom_point(alpha=0.4) +
  geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

geom_point() + geom_smooth() is a common combination.

Saving plots as variables

Plots can be saved as variables, which can be added to later on using the + operator. This is really useful if you want to make multiple related plots from a common base.

# Draw a ggplot
plt_price_vs_carat <- ggplot(
  # Use the diamonds dataset
  diamonds,
  # For the aesthetics, map x to carat and y to price
  aes(x=carat, y=price)
)

# Add a point layer to plt_price_vs_carat
plt_price_vs_carat + geom_point()

# Edit this to make points 20% opaque: plt_price_vs_carat_transparent
plt_price_vs_carat_transparent <- plt_price_vs_carat + geom_point(alpha=0.2)

# See the plot
plt_price_vs_carat_transparent

# Edit this to map color to clarity,
# Assign the updated plot to a new object
plt_price_vs_carat_by_clarity <- plt_price_vs_carat + geom_point(aes(color=clarity))

# See the plot
plt_price_vs_carat_by_clarity

By assigning parts of plots to a variable then reusing that variable in other plots, it makes it really clear how much those plots have in common.


2: Aesthetics

Aesthetic mappings are the cornerstone of the grammar of graphics plotting concept. This is where the magic happens - converting continuous and categorical data into visual scales that provide access to a large amount of information in a very short time. In this chapter you’ll understand how to choose the best aesthetic msiappings for your data.

Video: Visible aesthetics

View slides.

Species, a dataframe column, is mapped onto color, a visible Each mapped variable is its own column variable in the dataframe.

In general, try to keep your data and aethestics layer in the same ggplot definition.

Exercise: All about aesthetics: color, shape and size

In the video you saw 9 visible aesthetics. Let’s apply them to a categorical variable — the cylinders in mtcars, cyl.

These are the aesthetics you can consider within aes() in this chapter: x, y, color, fill, size, alpha, labels and shape.

One common convention is that you don’t name the x and y arguments to aes(), since they almost always come first, but you do name other arguments.

In the following exercise the fcyl column is categorical. It is cyl transformed into a factor.

# Create fycl, whilst preserving row names of mtcars (dplyr doesn't preserve row names with functions like mutate and filter)
library(tibble)
## Warning: package 'tibble' was built under R version 3.5.3
## 
## Attaching package: 'tibble'
## The following object is masked from 'package:assertive':
## 
##     has_rownames
mtcars <- mtcars %>% 
  rownames_to_column('carnames') %>%
  mutate(fcyl = as.factor(cyl)) %>%
    column_to_rownames('carnames')
# Map x to mpg and y to fcyl
ggplot(mtcars, aes(mpg, fcyl)) +
  geom_point()

# Swap mpg and fcyl
ggplot(mtcars, aes(fcyl, mpg)) +
  geom_point()

# Map x to wt, y to mpg and color to fcyl
ggplot(mtcars, aes(wt, mpg, color=fcyl)) +
  geom_point()

ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
  # Set the shape and size of the points
  geom_point(shape=1, size=4)

All about aesthetics: color vs. fill

Typically, the color aesthetic changes the outline of a geom and the fill aesthetic changes the inside. geom_point() is an exception: you use color (not fill) for the point color. However, some shapes have special behavior.

By default, geom_point() uses shape = 19: a solid circle. An alternative is shape = 21: a circle that allow you to use both fill for the inside and color for the outline. This is lets you to map two aesthetics to each point.

All shape values are described on the points() help page.

fcyl and fam are the cyl and am columns converted to factors, respectively.

# Create fam, whilst preserving row names of mtcars
mtcars <- mtcars %>% 
  rownames_to_column('carnames') %>%
  mutate(fam = as.factor(am)) %>%
    column_to_rownames('carnames')
# Map fcyl to fill
ggplot(mtcars, aes(wt, mpg, fill = fcyl)) +
  geom_point(shape = 1, size = 4)

ggplot(mtcars, aes(wt, mpg, fill = fcyl)) +
  # Change point shape; set alpha
  geom_point(shape = 21, size = 4, alpha = 0.6)

# Map color to fam
ggplot(mtcars, aes(wt, mpg, fill = fcyl, color = fam)) +
  geom_point(shape = 21, size = 4, alpha = 0.6)

Notice that mapping a categorical variable onto fill doesn’t change the colors, although a legend is generated! This is because the default shape for points only has a color attribute and not a fill attribute! Use fill when you have another shape (such as a bar), or when using a point that does have a fill and a color attribute, such as shape = 21, which is a circle with an outline. Any time you use a solid color, make sure to use alpha blending to account for over plotting.

All about aesthetics: comparing aesthetics

Now that you’ve got some practice with using attributes, be careful of a major pitfall: these attributes can overwrite the aesthetics of your plot!

# Establish the base layer
plt_mpg_vs_wt <- ggplot(mtcars, aes(wt, mpg))
# Map fcyl to size
plt_mpg_vs_wt +
  geom_point(aes(size = fcyl)) 
## Warning: Using size for a discrete variable is not advised.

# Map fcyl to alpha, not size
plt_mpg_vs_wt +
  geom_point(aes(alpha = fcyl))
## Warning: Using alpha for a discrete variable is not advised.

# Map fcyl to shape, not alpha
plt_mpg_vs_wt +
  geom_point(aes(shape = fcyl))

# Use text layer and map fcyl to label
plt_mpg_vs_wt +
  geom_text(aes(label = fcyl))

Which aesthetic do you think is the clearest for categorical data?

Exercise: Aesthetics for categorical & continuous variables

Many of the aesthetics can accept either continuous or categorical variables, but some are restricted to categorical data. For example, label and shape are only applicable to categorical data.

Video: Using attributes

View slides.

We usually use the word “aesthestics” to describe how something looks, but in ggplot2 we use the word to refer aesthetic mappings. We use the word “attributes” to describe how something looks. Note, confusingly all of our visible aesthestics also exist as attributes. Attributes are always called in the geom layer (see next chapter). For examplle, geom_*(color = "red") sets the color attribute to “red”. Similar for size and shape.

All about attributes: color, shape, size and alpha

This time you’ll use these arguments to set attributes of the plot, not map variables onto aesthetics.

You can specify colors in R using hex codes: a hash followed by two hexadecimal numbers each for red, green, and blue ("#RRGGBB"). Hexadecimal is base-16 counting. You have 0 to 9, and A representing 10 up to F representing 15. Pairs of hexadecimal numbers give you a range from 0 to 255. "#000000" is “black” (no color), "#FFFFFF" means “white”, and "#00FFFF" is cyan (mixed green and blue).

We’ll define a hexadecimal color variable my_blue.

# A hexadecimal color
my_blue <- "#4ABEFF"
ggplot(mtcars, aes(wt, mpg)) +
  # Set the point color and alpha
  geom_point(color = my_blue, alpha = 0.6)

# Change the color mapping to a fill mapping
ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
  # Set point size and shape
  geom_point(size = 10, shape = 1)

becomes…

# Change the color mapping to a fill mapping
ggplot(mtcars, aes(wt, mpg, fill = fcyl)) +
  # Set point size and shape
  geom_point(color = my_blue, size = 10, shape = 1)

ggplot2 lets you control these attributes in many ways to customize your plots.

All about attributes: conflicts with aesthetics

In the videos you saw that you can use all the aesthetics as attributes. Let’s see how this works with the aesthetics you used in the previous exercises: x, y, color, fill, size, alpha, label and shape.

In this exercise you will set all kinds of attributes of the points!

You will continue to work with mtcars.

ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
  # Add point layer with alpha 0.5
  geom_point(alpha = 0.5)

ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
  # Add text layer with label rownames(mtcars) and color red
  geom_text(aes(label = rownames(mtcars)), color = "red")

although DataCamp has the answer as

ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
  # Add text layer with label rownames(mtcars) and color red
  geom_text(label = rownames(mtcars), color = "red")

ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
  # Add points layer with shape 24 and color yellow
  geom_point(shape = 24, color = "yellow")

Going all out

In this exercise, you will gradually add more aesthetics layers to the plot. You’re still working with the mtcars dataset, but this time you’re using more features of the cars. Each of the columns is described on the mtcars help page.

Notice that adding more aesthetic mappings to your plot is not always a good idea! You may just increase complexity and decrease readability.

# 3 aesthetics: qsec vs. mpg, colored by fcyl
ggplot(mtcars, aes(mpg, qsec, color = fcyl)) +
  geom_point()

# 4 aesthetics: add a mapping of shape to fam
ggplot(mtcars, aes(mpg, qsec, color = fcyl, shape = fam)) +
  geom_point()

# 5 aesthetics: add a mapping of size to hp / wt
ggplot(mtcars, aes(mpg, qsec, color = fcyl, shape = fam, size = hp/wt)) +
  geom_point()

Between the x and y dimensions, the color, shape, and size of the points, your plot displays five dimensions of the dataset!

Video: Modifying aesthetics

View slides.

Positions: How to adjust for overlapping on a single layer.

position =

  • identity - the default position for our scatterplot, in essence, don’t do anything, and put the information where the data says to put the information (explicitly position = "identity"). However, if your data is measured to the nearest, say, mm, many points might overlap. We might to add some random noise on both axes to account for this, using…
  • jitter. Assign this to a variable before we call our plot (rather than as an argument within the plot) allowing you to set specific arguments (such as width, e.g.(0.1,) which says how much random noise should be added, and seed =, which sets the starting number used to generate a sequence of random numbers).
  • dodge
  • stack
  • fill
  • jitterdodge
  • nudge

Each of these aesthetics is a “scale” that we map data onto. So color is just a scale, like x and y are scales. We can access all these scales using the scale_ functions:

  • scale_x_*()
  • scale_y_*()
  • scale_color_*()
    • Also scale_colour_*()
  • scale_fill_*()
  • scale_shape_*()
  • scale_linetype_*()
  • scale_size_*()

What’s the * mean?

  • scale_x_continuous()
  • scale_color_discrete()

Updating aesthetic labels

In this exercise, you’ll modify some aesthetics to make a bar plot of the number of cylinders for cars with different types of transmission.

You’ll also make use of some functions for improving the appearance of the plot.

  • labs() to set the x- and y-axis labels. It takes strings for each argument.
  • scale_color_manual() defines properties of the color scale (i.e. axis). The first argument sets the legend title. values is a named vector of colors to use.
ggplot(mtcars, aes(fcyl, fill = fam)) +
  geom_bar() +
  # Set the axis labels
  labs(x = "Number of Cylinders",
  y = "Count")

palette <- c("#377EB8", "#E41A1C")
famlabs <- c("Automatic", "Manual")

ggplot(mtcars, aes(fcyl, fill = fam)) +
  geom_bar() +
  labs(x = "Number of Cylinders", y = "Count") +
  # Set the fill color scale
  scale_fill_manual("Transmission", values = palette, labels = famlabs)

# Set the position
ggplot(mtcars, aes(fcyl, fill = fam)) +
  geom_bar(position = "dodge") +
  labs(x = "Number of Cylinders", y = "Count") +
  # Set the fill color scale
  scale_fill_manual("Transmission", values = palette, labels = famlabs)

Choosing the right position argument is an important part of making a good plot.

Setting a dummy aesthetic

In the last chapter you saw that all the visible aesthetics can serve as attributes and aesthetics, but I very conveniently left out x and y. That’s because although you can make univariate plots (such as histograms, which you’ll get to in the next chapter), a y-axis will always be provided, even if you didn’t ask for it.

You can make univariate plots in ggplot2, but you will need to add a fake y axis by mapping y to zero.

When using setting y-axis limits, you can specify the limits as separate arguments, or as a single numeric vector. That is, ylim(lo, hi) or ylim(c(lo, hi)).

# Plot 0 vs. mpg
ggplot(mtcars, aes(mpg, 0)) +
  # Add jitter 
  geom_point(position = "jitter")

ggplot(mtcars, aes(mpg, 0)) +
  geom_jitter() +
  # Set the y-axis limits
  ylim(-2, 2)

The best way to make your plot depends on a lot of different factors and sometimes ggplot2 might not be the best choice.

Video: Aesthetics best practices

View slides.

Question: Appropriate mappings

Incorrect aesthetic mapping causes confusion or misleads the audience.

Typically, the dependent variable is mapped onto the the y-axis and the independent variable is mapped onto the x-axis.

In the ToothGrowth data set, we have three variables:

Variable Description
len Tooth length
supp Supplement type (VC or OJ)
dose Dose in milligrams/day

From the six possible ways to map three variables, one solution is shown in the viewer. However, x = supp, y = len, color = dose would be better.


3: Geometries

A plot’s geometry dictates what visual elements will be used. In this chapter, we’ll familiarize you with the geometries used in the three most common plot types you’ll encounter - scatter plots, bar charts and line plots. We’ll look at a variety of different ways to construct these plots.

Video: Scatter plots

View slides.

There are nearly fifty different geometries to choose from, though there are some redundancies. Source: DataCamp

Each geom is associated with specific aesthetic mappings. Some of these are essential, and some are optional.

For example, for geom_point(), x and y are essential, whereas alpha, color, fill, shape, size, and stroke are optional attribute settings.

Shape attribute values. Source: DataCamp

iris %>% 
  group_by(Species) %>% 
  summarise_all(mean) -> iris.summary

Overplotting 1: large datasets

Scatter plots (using geom_point()) are intuitive, easily understood, and very common, but we must always consider overplotting, particularly in the following four situations:

  1. Large datasets
  2. Aligned values on a single axis
  3. Low-precision data
  4. Integer data

Typically, alpha blending (i.e. adding transparency) is recommended when using solid shapes. Alternatively, you can use opaque but hollow shapes.

Small points are suitable for large datasets with regions of high density (lots of overlapping).

Let’s use the diamonds dataset to practice dealing with the large dataset case.

Recall earlier we created the base plot plt_price_vs_carat_by_clarity. We will redefine it slightly, by moving the color aesthing mapping (of clarity) inside ggplot.

# Plot price vs. carat, colored by clarity
plt_price_vs_carat_by_clarity <- ggplot(diamonds, aes(carat, price, color = clarity))

# Add a point layer with tiny points
plt_price_vs_carat_by_clarity + geom_point(alpha = 0.5, shape =".")

# Change shape
plt_price_vs_carat_by_clarity + geom_point(alpha = 0.5, shape = 16)

Overplotting 2: Aligned values

Let’s take a look at another case where we should be aware of overplotting: Aligning values on a single axis.

This occurs when one axis is continuous and the other is categorical, which can be overcome with some form of jittering.

# Plot base
plt_mpg_vs_fcyl_by_fam <- ggplot(mtcars, aes(fcyl, mpg, color = fam))

# Without jittering
plt_mpg_vs_fcyl_by_fam + geom_point()

# Alter the point positions by jittering, width 0.3
plt_mpg_vs_fcyl_by_fam + geom_point(position = position_jitter(0.3))

# Now jitter and dodge the point positions
plt_mpg_vs_fcyl_by_fam + geom_point(position = position_jitterdodge(jitter.width = 0.3, dodge.width = 0.3))

Note: Dodging preserves the vertical position of a geom while adjusting the horizontal position.

These are some simple ways of dealing with overplotting, but you’ll encounter more ideas througout the ggplot2 courses when we encounter atypical geoms.

Overplotting 3: Low-precision data

You already saw how to deal with overplotting when using geom_point() in two cases:

  1. Large datasets
  2. Aligned values on a single axis

We used position = 'jitter' inside geom_point() or geom_jitter().

Let’s take a look at another case:

  1. Low-precision data

This results from low-resolution measurements like in the iris dataset, which is measured to 1mm precision (see viewer). It’s similar to case 2, but in this case we can jitter on both the x and y axis.

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  # Swap for jitter layer with width 0.1
  geom_jitter(alpha = 0.5, width = 0.1)

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  # Set the position to jitter
  geom_point(alpha = 0.5, position = "jitter")

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  # Use a jitter position function with width 0.1
  geom_point(alpha = 0.5, position = position_jitter(width = 0.1))

Notice that jitter can be a geom itself (i.e. geom_jitter()), an argument in geom_point() (i.e. position = "jitter"), or a position function, (i.e. position_jitter()).

Overplotting 4: Integer data

Let’s take a look at the last case of dealing with overplotting:

  1. Integer data

This can be type integer (i.e. 1 ,2, 3…) or categorical (i.e. class factor) variables. factor is just a special class of type integer.

You’ll typically have a small, defined number of intersections between two variables, which is similar to case 3, but you may miss it if you don’t realize that integer and factor data are the same as low precision data.

The Vocab dataset provided contains the years of education and vocabulary test scores from respondents to US General Social Surveys from 1972-2004.

# Read in Vocab, found here: https://raw.githubusercontent.com/anilak1978/r-bridge-week-2-assignment/master/Vocab.csv
Vocab <- read.table("Vocab.csv", sep=",", header = TRUE)
Vocab <- Vocab %>% 
  mutate(year = as.numeric(year), education = as.factor(education), vocabulary = as.factor(vocabulary)) %>% 
  column_to_rownames('X')
str(Vocab)
## 'data.frame':    30351 obs. of  4 variables:
##  $ year      : num  1974 1974 1974 1974 1974 ...
##  $ sex       : Factor w/ 2 levels "Female","Male": 2 2 1 1 1 2 2 2 1 1 ...
##  $ education : Factor w/ 21 levels "0","1","2","3",..: 15 17 11 11 13 17 18 11 13 12 ...
##  $ vocabulary: Factor w/ 11 levels "0","1","2","3",..: 10 10 10 6 9 9 10 6 4 6 ...
# Examine the structure of Vocab
str(Vocab)
## 'data.frame':    30351 obs. of  4 variables:
##  $ year      : num  1974 1974 1974 1974 1974 ...
##  $ sex       : Factor w/ 2 levels "Female","Male": 2 2 1 1 1 2 2 2 1 1 ...
##  $ education : Factor w/ 21 levels "0","1","2","3",..: 15 17 11 11 13 17 18 11 13 12 ...
##  $ vocabulary: Factor w/ 11 levels "0","1","2","3",..: 10 10 10 6 9 9 10 6 4 6 ...
# Plot vocabulary vs. education
ggplot(Vocab, aes(education, vocabulary)) +
  # Add a point layer
  geom_point()

ggplot(Vocab, aes(education, vocabulary)) +
  # Change to a jitter layer
  geom_jitter()

ggplot(Vocab, aes(education, vocabulary)) +
  # Set the transparency to 0.2
  geom_jitter(alpha = 0.2)

ggplot(Vocab, aes(education, vocabulary)) +
  # Set the shape to 1
  geom_jitter(alpha = 0.2, shape = 1)

Notice how jittering and alpha blending serves as a great solution to the overplotting problem here. Setting the shape to 1 didn’t really help, but it was useful in the previous exercises when you had less data. You need to consider each plot individually. You’ll encounter this dataset again when you look at bar plots.

Video: Histogram

View slides.

Common plot types. Source: DataCamp

Drawing histograms

Recall that histograms cut up a continuous variable into discrete bins and, by default, maps the internally calculated count variable (the number of observations in each bin) onto the y aesthetic. An internal variable called density can be accessed by using the .. notation, i.e. ..density... Plotting this variable will show the relative frequency, which is the height times the width of each bin.

# Plot mpg
ggplot(mtcars, aes(mpg)) +
  # Add a histogram layer
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(mtcars, aes(mpg)) +
  # Set the binwidth to 1
  geom_histogram(binwidth = 1)

# Map y to ..density..
ggplot(mtcars, aes(mpg, ..density..)) +
  geom_histogram(binwidth = 1)

datacamp_light_blue <- "#51A8C9"

ggplot(mtcars, aes(mpg, ..density..)) +
  # Set the fill color to datacamp_light_blue
  geom_histogram(binwidth = 1, fill = datacamp_light_blue)

Histograms are one of the most common exploratory plots for continuous data. If you want to use density on the y-axis be sure to set your binwidth to an intuitive value.

Positions in histograms

Here, we’ll examine the various ways of applying positions to histograms. geom_histogram(), a special case of geom_bar(), has a position argument that can take on the following values:

  • stack (the default): Bars for different groups are stacked on top of each other.
  • dodge: Bars for different groups are placed side by side.
  • fill: Bars for different groups are shown as proportions.
  • identity: Plot the values as they appear in the dataset.

For this example, you’ll use the mtcars dataset.

# Update the aesthetics so the fill color is by fam
ggplot(mtcars, aes(mpg, fill = fam)) +
  geom_histogram(binwidth = 1)

ggplot(mtcars, aes(mpg, fill = fam)) +
  # Change the position to dodge
  geom_histogram(binwidth = 1, position = "dodge")

ggplot(mtcars, aes(mpg, fill = fam)) +
  # Change the position to fill
  geom_histogram(binwidth = 1, position = "fill")
## Warning: Removed 16 rows containing missing values (geom_bar).

ggplot(mtcars, aes(mpg, fill = fam)) +
  # Change the position to identity, with transparency 0.4
  geom_histogram(binwidth = 1, position = "identity", alpha = 0.4)

This nicely demonstrates the difference between "stack" (the default) and "identity".

Now proceed with line plots!

Video: Bar plots

View slides.

Position in bar and col plots

Let’s see how the position argument changes geom_bar().

We have three position options:

  • stack: The default
  • dodge: Preferred
  • fill: To show proportions

While we will be using geom_bar() here, note that the function geom_col() is just geom_bar() where both the position and stat arguments are set to "identity". It is used when we want the heights of the bars to represent the exact values in the data.

In this exercise, you’ll draw the total count of cars having a given number of cylinders (fcyl), according to manual or automatic transmission type (fam).

# Plot fcyl, filled by fam
ggplot(mtcars, aes(fcyl, fill=fam)) +
  # Add a bar layer
  geom_bar()

ggplot(mtcars, aes(fcyl, fill = fam)) +
  # Set the position to "fill"
  geom_bar(position = "fill")

ggplot(mtcars, aes(fcyl, fill = fam)) +
  # Change the position to "dodge"
  geom_bar(position = "dodge")

Different kinds of plots need different position arguments, so it’s important to be familiar with this attribute.

Overlapping bar plots

You can customize bar plots further by adjusting the dodging so that your bars partially overlap each other. Instead of using position = "dodge", you’re going to use position_dodge(), like you did with position_jitter() in the the previous exercises. Here, you’ll save this as an object, posn_d, so that you can easily reuse it.

Remember, the reason you want to use position_dodge() (and position_jitter()) is to specify how much dodging (or jittering) you want.

We start with (from before)…

ggplot(mtcars, aes(cyl, fill = fam)) +
  # Change position to use the functional form, with width 0.2
  geom_bar(position = "dodge")

…then move to…

ggplot(mtcars, aes(cyl, fill = fam)) +
  # Change position to use the functional form, with width 0.2
  geom_bar(position = position_dodge(width = 0.2))

finishing with…

ggplot(mtcars, aes(cyl, fill = fam)) +
  # Set the transparency to 0.6
  geom_bar(position = position_dodge(width = 0.2), alpha = 0.6)

By using these position functions, you can customize your plot to suit your needs.

Bar plots: sequential color palette

In this bar plot, we’ll fill each segment according to an ordinal variable. The best way to do that is with a sequential color palette.

Here’s an example of using a sequential color palette with the mtcars dataset:

ggplot(mtcars, aes(fcyl, fill = fam)) +
  geom_bar() +
  scale_fill_brewer(palette = "Set1")

In the exercise, you’ll use similar code on the the Vocab dataset. Both datasets are ordinal.

# Plot education, filled by vocabulary
ggplot(Vocab, aes(education, fill = vocabulary))

# Plot education, filled by vocabulary
ggplot(Vocab, aes(education, fill = vocabulary)) +
  # Add a bar layer with position "fill"
  geom_bar(position = "fill")

# Plot education, filled by vocabulary
ggplot(Vocab, aes(education, fill = vocabulary)) +
  # Add a bar layer with position "fill"
  geom_bar(position = "fill") +
  # Add a brewer fill scale with default palette
  scale_fill_brewer(palette = "Set1")
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors

The plot is not complete! Let’s fix this in the next exercise.

Video: Line plots

View slides.

Basic line plots

Here, we’ll use the economics dataset to make some line plots. The dataset contains a time series for unemployment and population statistics from the Federal Reserve Bank of St. Louis in the United States. The data is contained in the ggplot2 package.

To begin with, you can look at how the median unemployment time and the unemployment rate (the number of unemployed people as a proportion of the population) change over time.

# Print the head of economics
head(economics)
## # A tibble: 6 x 6
##   date         pce    pop psavert uempmed unemploy
##   <date>     <dbl>  <dbl>   <dbl>   <dbl>    <dbl>
## 1 1967-07-01  507. 198712    12.6     4.5     2944
## 2 1967-08-01  510. 198911    12.6     4.7     2945
## 3 1967-09-01  516. 199113    11.9     4.6     2958
## 4 1967-10-01  512. 199311    12.9     4.9     3143
## 5 1967-11-01  517. 199498    12.8     4.7     3066
## 6 1967-12-01  525. 199657    11.8     4.8     3018
# Using economics, plot unemploy vs. date
ggplot(economics, aes(date, unemploy)) +
  # Make it a line plot
  geom_line()

# Change the y-axis to the proportion of the population that is unemployed
ggplot(economics, aes(date, unemploy/pop)) +
  geom_line()

In the next exercise, we’ll make more complicated line plots.

Multiple time series

We already saw how the form of your data affects how you can plot it. Let’s explore that further with multiple time series. Here, it’s important that all lines are on the same scale, and if possible, on the same plot.

fish.species contains the global capture rates of seven salmon species from 1950–2010. Each variable (column) is a Salmon species and each observation (row) is one year. fish.tidy contains the same data, but in three columns: Species, Year, and Capture (i.e. one variable per column).

The following code will not be run as the dataset is not available.

# Plot the Rainbow Salmon time series
ggplot(fish.species, aes(x = Year, y = Rainbow)) +
  geom_line()
# Plot the Pink Salmon time series
ggplot(fish.species, aes(x = Year, y = Pink)) +
  geom_line()
# Plot multiple time-series by grouping by species
ggplot(fish.tidy, aes(Year, Capture)) +
  geom_line(aes(group = Species))
# Plot multiple time-series by coloring by species
ggplot(fish.tidy, aes(Year, Capture, color = Species)) +
  geom_line(aes(group = Species))

As you can see in the the last couple of plots, a grouping aesthetic was vital here. If you don’t specify color = Species, you’ll get a mess of lines.


4: Themes

In this chapter, we’ll explore how understanding the structure of your data makes data visualization much easier. Plus, it’s time to make our plots pretty. This is the last step in the data viz process. The Themes layer will enable you to make publication quality plots directly in R. In the next course we’ll look at some extra layers to add more variables to your plots.

Video: Themes from scratch

View slides.

Moving the legend

Let’s wrap up this course by making a publication-ready plot communicating a clear message.

To change stylistic elements of a plot, call theme() and set plot properties to a new value. For example, the following changes the legend position.

p + theme(legend.position = new_value)

Here, the new value can be

  • "top", "bottom", "left", or "right": place it at that side of the plot.
  • "none": don’t draw it.
  • c(x, y), where c(0, 0) means the bottom-left and c(1, 1) means the top-right.

Let’s revisit the recession period line plot (assigned to plt_prop_unemployed_over_time).

# View the default plot
plt_prop_unemployed_over_time

# Remove legend entirely
plt_prop_unemployed_over_time +
  theme(legend.position = "none")
# Position the legend at the bottom of the plot
plt_prop_unemployed_over_time +
  theme(legend.position = "bottom")
# Position the legend inside the plot at (0.6, 0.1)
plt_prop_unemployed_over_time +
  theme(legend.position = c(0.6, 0.1))

But be careful when placing a legend inside your plotting space. You could end up obscuring data.

Modifying theme elements

Many plot elements have multiple properties that can be set. For example, line elements in the plot such as axes and gridlines have a color, a thickness (size), and a line type (solid line, dashed, or dotted). To set the style of a line, you use element_line(). For example, to make the axis lines into red, dashed lines, you would use the following.

p + theme(axis.line = element_line(color = "red", linetype = "dashed"))

Similarly, element_rect() changes rectangles and element_text() changes text. You can remove a plot element using element_blank().

plt_prop_unemployed_over_time +
  theme(
    # For all rectangles, set the fill color to grey92
    rect = element_rect(fill = "grey92"),
    # For the legend key, turn off the outline
    legend.key = element_rect(color = NA)
  )
plt_prop_unemployed_over_time +
  theme(
    rect = element_rect(fill = "grey92"),
    legend.key = element_rect(color = NA),
    # Turn off axis ticks
    axis.ticks = element_blank(),
    # Turn off the panel grid
    panel.grid = element_blank()
  )
plt_prop_unemployed_over_time +
  theme(
    rect = element_rect(fill = "grey92"),
    legend.key = element_rect(color = NA),
    axis.ticks = element_blank(),
    panel.grid = element_blank(),
    panel.grid.major.y = element_line(
      color = "white",
      size = 0.5,
      linetype = "dotted"
    ),
    # Set the axis text color to grey25
    axis.text = element_text(color = "grey25"),
    # Set the plot title font face to italic and font size to 16
   plot.title = element_text(size = 16, face = "italic")
  )

This plot is ready for prime time – it’s pretty AND informative. Make sure that all your text is legible for the context in which it will be viewed.

Modifying whitespace

Whitespace means all the non-visible margins and spacing in the plot.

To set a single whitespace value, use unit(x, unit), where x is the amount and unit is the unit of measure.

Borders require you to set 4 positions, so use margin(top, right, bottom, left, unit). To remember the margin order, think TRouBLe.

The default unit is "pt" (points), which scales well with text. Other options include “cm”, “in” (inches) and “lines” (of text).

plt_mpg_vs_wt_by_cyl is available. The panel and legend are wrapped in blue boxes so you can see how they change.

# View the original plot
plt_mpg_vs_wt_by_cyl

plt_mpg_vs_wt_by_cyl +
  theme(
    # Set the axis tick length to 2 lines
    axis.ticks.length = unit(2, "lines")
  )
plt_mpg_vs_wt_by_cyl +
  theme(
    # Set the legend key size to 3 centimeters
    legend.key.size = unit(3, "cm")
  )
plt_mpg_vs_wt_by_cyl +
  theme(
    # Set the legend margin to (20, 30, 40, 50) points
    legend.margin = margin(20, 30, 40, 50, "pt")
  )
plt_mpg_vs_wt_by_cyl +
  theme(
    # Set the plot margin to (10, 30, 50, 70) millimeters
    plot.margin = margin(10, 30, 50, 70, "mm")
  )

Changing the whitespace can be useful if you need to make your plot more compact, or if you want to create more space to reduce “business”.

Video: Theme flexibility

View slides.

Built-in themes

In addition to making your own themes, there are several out-of-the-box solutions that may save you lots of time.

  • theme_gray() is the default.
  • theme_bw() is useful when you use transparency.
  • theme_classic() is more traditional.
  • theme_void() removes everything but the data.
# Add a black and white theme
plt_prop_unemployed_over_time +
  theme_bw()
# Add a classic theme
plt_prop_unemployed_over_time +
  theme_classic()
# Add a void theme
plt_prop_unemployed_over_time +
  theme_void()

The black and white theme works really well if you use transparency in your plot.

Exploring ggthemes

Outside of ggplot2, another source of built-in themes is the ggthemes package. Let’s explore some of the ready-made ggthemes themes.

# Use the fivethirtyeight theme
plt_prop_unemployed_over_time +
  theme_fivethirtyeight()
# Use Tufte's theme
plt_prop_unemployed_over_time +
  theme_tufte()
# Use the Wall Street Journal theme
plt_prop_unemployed_over_time +
  theme_wsj()

ggthemes has over 20 themes for you to try.

Setting themes

Reusing a theme across many plots helps to provide a consistent style. You have several options for this.

  1. Assign the theme to a variable, and add it to each plot.
  2. Set your theme as the default using theme_set().

A good strategy that you’ll use here is to begin with a built-in theme then modify it.

plt_prop_unemployed_over_time is available. The theme you made earlier is shown in the sample code.

# Save the theme as theme_recession
theme_recession <- theme(
  rect = element_rect(fill = "grey92"),
  legend.key = element_rect(color = NA),
  axis.ticks = element_blank(),
  panel.grid = element_blank(),
  panel.grid.major.y = element_line(color = "white", size = 0.5, linetype = "dotted"),
  axis.text = element_text(color = "grey25"),
  plot.title = element_text(face = "italic", size = 16),
  legend.position = c(0.6, 0.1)
)

# Combine the Tufte theme with theme_recession
theme_tufte_recession <- theme_tufte() + theme_recession

# Add the Tufte recession theme to the plot
plt_prop_unemployed_over_time + theme_tufte_recession
# Set theme_tufte_recession as the default theme
theme_set(theme_tufte_recession)

# Draw the plot (without explicitly adding a theme)
plt_prop_unemployed_over_time

Publication-quality plots

We’ve seen many examples of beautiful, publication-quality plots. Let’s take a final look and put all the pieces together.

plt_prop_unemployed_over_time +
  # Add Tufte's theme
  theme_tufte()
plt_prop_unemployed_over_time +
  theme_tufte() +
  # Add individual theme elements
  theme(
    # Turn off the legend
    legend.position = "none",
    # Turn off the axis ticks
    axis.ticks = element_blank()
  )
plt_prop_unemployed_over_time +
  theme_tufte() +
  theme(
    legend.position = "none",
    axis.ticks = element_blank(),
    # Set the axis title's text color to grey60
    axis.title = element_text(color = "grey60"),
    # Set the axis text's text color to grey60
    axis.text = element_text(color = "grey60")
  )
plt_prop_unemployed_over_time +
  theme_tufte() +
  theme(
    legend.position = "none",
    axis.ticks = element_blank(),
    axis.title = element_text(color = "grey60"),
    axis.text = element_text(color = "grey60"),
    # Set the panel gridlines major y values
    panel.grid.major.y = element_line(
      # Set the color to grey60
      color = "grey60",
      # Set the size to 0.25
      size = 0.25,
      # Set the linetype to dotted
      linetype = "dotted"
    )
  )

That will look great in a publication!

Video: Effective explanatory plots

View slides.

Using geoms for explanatory plots

Let’s focus on producing beautiful and effective explanatory plots. In the next couple of exercises, you’ll create a plot that is similar to the one shown in the video using gm2007, a filtered subset of the gapminder dataset.

This type of plot will be in an info-viz style, meaning that it would be similar to something you’d see in a magazine or website for a mostly lay audience.

A scatterplot of lifeExp by country, colored by lifeExp, with points of size 4, is provided.

# Create datasets gm2007_full and gm2007
library(gapminder)
## Warning: package 'gapminder' was built under R version 3.5.3
gm2007_full <- gapminder %>% 
  filter(year == 2007)
gm2007 <- gm2007_full %>% 
  filter(country == "Swaziland" | country == "Mozambique" | country == "Zambia" | country == "Sierra Leone" | country == "Lesotho" | country == "Angola" | country == "Zimbabwe" | country == "Afghanistan" | country == "Central African Republic" | country == "Liberia" | country == "Canada" | country == "France" | country == "Israel" | country == "Sweden" | country == "Spain" | country == "Australia" | country == "Switzerland" | country == "Iceland" | country == "Hong Kong, China" | country == "Japan")
# Add a geom_segment() layer
ggplot(gm2007, aes(x = lifeExp, y = country, color = lifeExp)) +
  geom_point(size = 4) +
  geom_segment(aes(xend = 30, yend = country), size = 2)

# Add a geom_text() layer
ggplot(gm2007, aes(x = lifeExp, y = country, color = lifeExp)) +
  geom_point(size = 4) +
  geom_segment(aes(xend = 30, yend = country), size = 2) +
  geom_text(aes(label = lifeExp), color = "white", size = 1.5)

# Set the color scale
library(RColorBrewer)
## Warning: package 'RColorBrewer' was built under R version 3.5.2
palette <- brewer.pal(5, "RdYlBu")[-(2:4)]

# Modify the scales
ggplot(gm2007, aes(x = lifeExp, y = country, color = lifeExp)) +
  geom_point(size = 4) +
  geom_segment(aes(xend = 30, yend = country), size = 2) +
  geom_text(aes(label = round(lifeExp,1)), color = "white", size = 1.5) +
  scale_x_continuous("", expand = c(0,0), limits = c(30, 90), position = "top") +
  scale_color_gradientn(colors = palette)

# Add a title and caption
ggplot(gm2007, aes(x = lifeExp, y = country, color = lifeExp)) +
  geom_point(size = 4) +
  geom_segment(aes(xend = 30, yend = country), size = 2) +
  geom_text(aes(label = round(lifeExp,1)), color = "white", size = 1.5) +
  scale_x_continuous("", expand = c(0,0), limits = c(30,90), position = "top") +
  scale_color_gradientn(colors = palette) +
  labs(title = "Highest and lowest life expectancies, 2007", caption = "Source: gapminder")

Let’s continue adding to this plot in the next exercise.

Using annotate() for embellishments

In the previous exercise, we completed our basic plot. Now let’s polish it by playing with the theme and adding annotations. In this exercise, you’ll use annotate() to add text and a curve to the plot.

The following values have been calculated for you to assist with adding embellishments to the plot:

global_mean <- mean(gm2007_full$lifeExp)
x_start <- global_mean + 4
y_start <- 5.5
x_end <- global_mean
y_end <- 7.5

Let’s assign our previous plot to plt_country_vs_lifeExp.

# Add a title and caption
plt_country_vs_lifeExp <- ggplot(gm2007, aes(x = lifeExp, y = country, color = lifeExp)) +
  geom_point(size = 4) +
  geom_segment(aes(xend = 30, yend = country), size = 2) +
  geom_text(aes(label = round(lifeExp,1)), color = "white", size = 1.5) +
  scale_x_continuous("", expand = c(0,0), limits = c(30,90), position = "top") +
  scale_color_gradientn(colors = palette) +
  labs(title = "Highest and lowest life expectancies, 2007", caption = "Source: gapminder")
# Define the theme
plt_country_vs_lifeExp +
  theme_classic() +
  theme(axis.line.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text = element_text(color = "black"),
        axis.title = element_blank(),
        legend.position = "none")

# Assign themes to step_1_theme
step_1_themes <- theme_classic() +
  theme(axis.line.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text = element_text(color = "black"),
        axis.title = element_blank(),
        legend.position = "none")
# Add a vertical line
plt_country_vs_lifeExp +
  step_1_themes +
  geom_vline(xintercept = global_mean, color = "grey40", linetype = 3)

# Add text
plt_country_vs_lifeExp +
  step_1_themes +
  geom_vline(xintercept = global_mean, color = "grey40", linetype = 3) +
  annotate(
    "text",
    x = x_start, y = y_start,
    label = "The\nglobal\naverage",
    vjust = 1, size = 3, color = "grey40"
  )

# Assign annotation to step_3_annotation
step_3_annotation <- annotate(
  "text",
  x = x_start, y = y_start,
  label = "The\nglobal\naverage",
  vjust = 1, size = 3, color = "grey40"
  )
# Add a curve
plt_country_vs_lifeExp +  
  step_1_themes +
  geom_vline(xintercept = global_mean, color = "grey40", linetype = 3) +
  step_3_annotation +
  annotate(
    "curve",
    x = x_start, y = y_start,
    xend = x_end, yend = y_end,
    arrow = arrow(length = unit(0.2, "cm"), type = "closed"),
    color = "grey40"
  )


Module 7: Intermediate Data Visualization with ggplot2

This ggplot2 course builds on your knowledge from the introductory course to produce meaningful explanatory plots. Statistics will be calculated on the fly and you’ll see how Coordinates and Facets aid in communication. You’ll also explore details of data visualization best practices with ggplot2 to help make sure you have a sound understanding of what works and why. By the end of the course, you’ll have all the tools needed to make a custom plotting function to explore a large data set, combining statistics and excellent visuals.

1: Statistics

A picture paints a thousand words, which is why R ggplot2 is such a powerful tool for graphical data analysis. In this chapter, you’ll progress from simply plotting data to applying a variety of statistical methods. These include a variety of linear models, descriptive and inferential statistics (mean, standard deviation and confidence intervals) and custom functions.

Video: Stats with geoms

View slides.

Smoothing