In Introduction to R, you will master the basics of this widely used open source language, including factors, lists, and data frames. With the knowledge gained in this course, you will be ready to undertake your first very own data analysis. Oracle estimated over 2 million R users worldwide in 2012, cementing R as a leading programming language in statistics and data science. Every year, the number of R users grows by about 40%, and an increasing number of organizations are using it in their day-to-day activities. Begin your journey to learn R with us today!
Take your first steps with R. In this chapter, you will learn how to use the console as a calculator and how to assign variables. You will also get to know the basic data types in R. Let’s get started.
In the editor on the right you should type R code to solve the exercises. When you hit the ‘Submit Answer’ button, every line of code is interpreted and executed by R and you get a message whether or not your code was correct. The output of your R code is shown in the console in the lower right corner.
R makes use of the # sign to add comments, so that you and others can understand what the R code is about. Just like Twitter! Comments are not run as R code, so they will not influence your result. For example, Calculate 3 + 4 in the editor on the right is a comment.
You can also execute R commands straight in the console. This is a good way to experiment with R code, as your submission is not checked for correctness.
3 + 4
## [1] 7
6 + 12
## [1] 18
See how the console shows the result of the R code you submitted? Now that you’re familiar with the interface, let’s get down to R business!
In its most basic form, R can be used as a simple calculator. Consider the following arithmetic operators:
Addition: +
Subtraction: -
Multiplication: *
Division: /
Exponentiation: ^
Modulo: %%
The last two might need some explaining:
The ^ operator raises the number to its left to the power of the number to its right: for example 3^2 is 9. The modulo returns the remainder of the division of the number to the left by the number on its right, for example 5 modulo 3 or 5 %% 3 is 2. With this knowledge, follow the instructions below to complete the exercise.
5 + 5
## [1] 10
5 - 5
## [1] 0
3 * 5
## [1] 15
(5 + 5) / 2
## [1] 5
2^5
## [1] 32
28 %% 6
## [1] 4
A basic concept in (statistical) programming is called a variable.
A variable allows you to store a value (e.g. 4) or an object (e.g. a function description) in R. You can then later use this variable’s name to easily access the value or the object that is stored within this variable.
You can assign a value 4 to a variable my_var with the command
my_var <- 4
x <- 42
x
## [1] 42
Have you noticed that R does not print the value of a variable to the console when you did the assignment? x <- 42 did not generate any output, because R assumes that you will be needing this variable in the future. Otherwise you wouldn’t have stored the value in a variable in the first place, right? Proceed to the next exercise!
Suppose you have a fruit basket with five apples. As a data analyst in training, you want to store the number of apples in a variable with the name my_apples.
my_apples <- 5
my_apples
## [1] 5
Every tasty fruit basket needs oranges, so you decide to add six oranges. As a data analyst, your reflex is to immediately create the variable my_oranges and assign the value 6 to it. Next, you want to calculate how many pieces of fruit you have in total. Since you have given meaningful names to these values, you can now code this in a clear way:
my_apples + my_oranges
my_oranges <- 6
my_apples + my_oranges
## [1] 11
my_fruit <- my_apples + my_oranges
Nice one! The great advantage of doing calculations with variables is reusability. If you just change my_apples to equal 12 instead of 5 and rerun the script, my_fruit will automatically update as well. Continue to the next exercise.
Common knowledge tells you not to add apples and oranges. But hey, that is what you just did, no :-)? The my_apples and my_oranges variables both contained a number in the previous exercise. The + operator works with numeric variables in R. If you really tried to add “apples” and “oranges”, and assigned a text value to the variable my_oranges (see the editor), you would be trying to assign the addition of a numeric and a character variable to the variable my_fruit. This is not possible.
R works with numerous data types. Some of the most basic types to get started are:
Decimal values like 4.5 are examples of the data type numeric. Whole numbers like 3, 4, 0 and -4 are called integers. Integers are also examples of the numeric data type. Boolean values (TRUE or FALSE) are called logical. Text (or string) values are examples of the data type character. Note how the quotation marks indicate that “the text inside quotation marks” is a character.
my_numeric <- 42
my_character <- "universe"
my_logical <- FALSE
If you set the variable my_oranges to have the value 5 and the variable my_oranges to have the value “six”, attempting to add them will give an error. This is due to a mismatch in data types? You can add two (or more) numerics together, but you can’t add a numeric and a character together. Avoid such embarrassing situations by checking the data type of a variable beforehand. You can do this with the class() function, as the code below shows.
class(my_numeric)
## [1] "numeric"
class(my_character)
## [1] "character"
class(my_logical)
## [1] "logical"
This was the last exercise for this chapter. Head over to the next chapter to get immersed in the world of vectors!
We take you on a trip to Vegas, where you will learn how to analyze your gambling results using vectors in R. After completing this chapter, you will be able to create vectors in R, name them, select elements from them, and compare different vectors.
Feeling lucky? You better, because this chapter takes you on a trip to the City of Sins, also known as Statisticians Paradise!
Thanks to R and your new data-analytical skills, you will learn how to uplift your performance at the tables and fire off your career as a professional gambler. This chapter will show how you can easily keep track of your betting progress and how you can do some simple analyses on past actions. Next stop, Vegas Baby… VEGAS!!
vegas <- "Go!"
Let us focus first!
On your way from rags to riches, you will make extensive use of vectors. Vectors are one-dimension arrays that can hold numeric data, character data, or logical data. In other words, a vector is a simple tool to store data. For example, you can store your daily gains and losses in the casinos.
In R, you create a vector with the combine function c(). You place the vector elements separated by a comma between the parentheses. For example:
numeric_vector <- c(1, 2, 3)
character_vector <- c("a", "b", "c")
Once you have created these vectors in R, you can use them to do calculations.
numeric_vector <- c(1, 10, 49)
character_vector <- c("a", "b", "c")
boolean_vector <- c(TRUE, FALSE, TRUE)
Notice that adding a space behind the commas in the c() function improves the readability of your code, but changes nothing else. Let’s practice some more with vector creation in the next exercise.
poker_vector <- c(140, -50, 20, -120, 240)
roulette_vector <- c(-24, -50, 100, -350, 10)
To check out the contents of your vectors, remember that you can always simply type the variable in the console and hit Enter.
As a data analyst, it is important to have a clear view on the data that you are using. Understanding what each element refers to is therefore essential.
In the previous exercise, we created a vector with your winnings over the week. Each vector element refers to a day of the week but it is hard to tell which element belongs to which day. It would be nice if you could show that in the vector itself.
You can give a name to the elements of a vector with the names() function. Have a look at this example:
some_vector <- c("John Doe", "poker player") names(some_vector) <- c("Name", "Profession")
This code first creates a vector some_vector and then gives the two elements a name. The first element is assigned the name Name, while the second element is labeled Profession. Printing the contents to the console yields following output:
Name Profession
"John Doe" "poker player"
names(poker_vector) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
names(roulette_vector) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
If you want to become a good statistician, you have to become lazy. (If you are already lazy, chances are high you are one of those exceptional, natural-born statistical talents.)
In the previous exercises you probably experienced that it is boring and frustrating to type and retype information such as the days of the week. However, when you look at it from a higher perspective, there is a more efficient way to do this, namely, to assign the days of the week vector to a variable!
Just like you did with your poker and roulette returns, you can also create a variable that contains the days of the week. This way you can use and re-use it.
days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
names(roulette_vector) <- days_vector
names(poker_vector) <- days_vector
A word of advice: try to avoid code duplication at all times. Continue to the next exercise and learn how to do arithmetic with vectors!
Now that you have the poker and roulette winnings nicely as named vectors, you can start doing some data analytical magic.
You want to find out the following type of information:
How much has been your overall profit or loss per day of the week? Have you lost money over the week in total? Are you winning/losing money on poker or on roulette? To get the answers, you have to do arithmetic calculations on vectors.
It is important to know that if you sum two vectors in R, it takes the element-wise sum. For example, the following three statements are completely equivalent:
c(1, 2, 3) + c(4, 5, 6)
c(1 + 4, 2 + 5, 3 + 6)
c(5, 7, 9)
You can also do the calculations with variables that represent vectors:
a <- c(1, 2, 3)
b <- c(4, 5, 6)
c <- a + b
A_vector <- c(1, 2, 3)
B_vector <- c(4, 5, 6)
total_vector <- A_vector + B_vector
total_vector
## [1] 5 7 9
Now you understand how R does arithmetic with vectors, it is time to get those Ferraris in your garage! First, you need to understand what the overall profit or loss per day of the week was. The total daily profit is the sum of the profit/loss you made on poker per day, and the profit/loss you made on roulette per day.
In R, this is just the sum of roulette_vector and poker_vector.
total_daily <- poker_vector + roulette_vector
total_daily
## Monday Tuesday Wednesday Thursday Friday
## 116 -100 120 -470 250
Based on the previous analysis, it looks like you had a mix of good and bad days. This is not what your ego expected, and you wonder if there may be a very tiny chance you have lost money over the week in total?
A function that helps you to answer this question is sum(). It calculates the sum of all elements of a vector. For example, to calculate the total amount of money you have lost/won with poker you do:
total_poker <- sum(poker_vector)
total_poker <- sum(poker_vector)
total_poker
## [1] 230
total_roulette <- sum(roulette_vector)
total_roulette
## [1] -314
total_week <- total_poker + total_roulette
total_week
## [1] -84
Oops, it seems like you are losing money. Time to rethink and adapt your strategy! This will require some deeper analysis…
After a short brainstorm in your hotel’s jacuzzi, you realize that a possible explanation might be that your skills in roulette are not as well developed as your skills in poker. So maybe your total gains in poker are higher (or > ) than in roulette.
total_poker > total_roulette
## [1] TRUE
Your hunch seemed to be right. It appears that the poker game is more your cup of tea than roulette.
Another possible route for investigation is your performance at the beginning of the working week compared to the end of it. You did have a couple of Margarita cocktails at the end of the week…
To answer that question, you only want to focus on a selection of the total_vector. In other words, our goal is to select specific elements of the vector. To select elements of a vector (and later matrices, data frames, …), you can use square brackets. Between the square brackets, you indicate what elements to select. For example, to select the first element of the poker vector, you type poker_vector[1]. To select the second element of the vector, you type poker_vector[2], etc. Notice that the first element in a vector in R has index 1, not 0 as in many other programming languages.
poker_wednesday <- poker_vector[3]
R also makes it possible to select multiple elements from a vector at once. Learn how in the next exercise!
How about analyzing your midweek results?
To select multiple elements from a vector, you can add square brackets at the end of it. You can indicate between the brackets what elements should be selected. For example: suppose you want to select the first and the fifth day of the week: use the vector c(1, 5) between the square brackets. For example, the code below selects the first and fifth element of poker_vector:
poker_vector[c(1, 5)]
poker_midweek <- poker_vector[c(2,3,4)]
Continue to the next exercise to specialize in vector selection some more!
Selecting multiple elements of poker_vector with c(2, 3, 4) is not very convenient. Many statisticians are lazy people by nature, so they created an easier way to do this: c(2, 3, 4) can be abbreviated to 2:4, which generates a vector with all natural numbers from 2 up to and including 4.
So, another way to find the mid-week results is poker_vector[2:4].
roulette_selection_vector <- roulette_vector[2:5]
The colon operator is extremely useful and very often used in R programming, so remember it well.
Another way to tackle the previous exercise is by using the names of the vector elements (Monday, Tuesday, …) instead of their numeric positions. For example,
poker_vector["Monday"]
will select the first element of poker_vector because “Monday” is the name of that first element.
Just like you did in the previous exercise with numerics, you can also use the element names to select multiple elements, for example:
poker_vector[c("Monday","Tuesday")]
poker_start <- poker_vector[c("Monday", "Tuesday","Wednesday")]
mean(poker_start)
## [1] 36.66667
Good job! Apart from subsetting vectors by index or by name, you can also subset vectors by comparison. The next exercises will show you how!
By making use of comparison operators, we can approach the previous question in a more proactive way.
The (logical) comparison operators known to R are:
< for less than> for greater than<= for less than or equal to>= for greater than or equal to== for equal to each other!= not equal to each otherAs seen in the previous chapter, stating 6 > 5 returns TRUE. The nice thing about R is that you can use these comparison operators also on vectors. For example:
c(4, 5, 6) > 5
[1] FALSE FALSE TRUE
This command tests for every element of the vector if the condition stated by the comparison operator is TRUE or FALSE.
poker_vector > 0
## Monday Tuesday Wednesday Thursday Friday
## TRUE FALSE TRUE FALSE TRUE
selection_vector <- poker_vector > 0
selection_vector
## Monday Tuesday Wednesday Thursday Friday
## TRUE FALSE TRUE FALSE TRUE
Working with comparisons will make your data analytical life easier. Instead of selecting a subset of days to investigate yourself (like before), you can simply ask R to return only those days where you realized a positive return for poker.
In the previous exercises you used selection_vector <- poker_vector > 0 to find the days on which you had a positive poker return. Now, you would like to know not only the days on which you won, but also how much you won on those days.
You can select the desired elements, by putting selection_vector between the square brackets that follow poker_vector:
poker_vector[selection_vector]
R knows what to do when you pass a logical vector in square brackets: it will only select the elements that correspond to TRUE in selection_vector.
poker_winning_days <- poker_vector[selection_vector]
poker_winning_days
## Monday Wednesday Friday
## 140 20 240
I printed out selection_vector and poker_winning_days along the way, just to be sure I was doing it right.
Just like you did for poker, you also want to know those days where you realized a positive return for roulette.
selection_vector <- roulette_vector > 0
Notice how we’re reusing selection_vector for the second time.
roulette_winning_days <- roulette_vector[selection_vector]
This exercise concludes the chapter on vectors. The next chapter will introduce you to the two-dimensional version of vectors: matrices.
In this chapter, you will learn how to work with matrices in R. By the end of the chapter, you will be able to create matrices and understand how to do basic computations with them. You will analyze the box office numbers of the Star Wars movies and learn how to use matrices in R. May the force be with you!
In R, a matrix is a collection of elements arranged into a fixed number of rows and columns. Since you are only working with rows and columns, a matrix is called two-dimensional.
You can construct a matrix in R with the matrix() function. Consider the following example:
matrix(1:9, byrow = TRUE, nrow = 3)
In the matrix() function:
1:9 which, as we’ve seen before, is a shortcut for c(1, 2, 3, 4, 5, 6, 7, 8, 9).byrow, indicates that the matrix is filled by the rows. If we want the matrix to be filled by the columns, we just place byrow = FALSE.nrow indicates that the matrix should have three rows.matrix(1:9, byrow=TRUE, nrow = 3)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
What happens, do you think, if you change TRUE in the above code to FALSE? Try it and see.
It is now time to get your hands dirty. In the following exercises you will analyze the box office numbers of the Star Wars franchise. May the force be with you!
In the editor, three vectors are defined. Each one represents the box office numbers from the first three Star Wars movies. The first element of each vector indicates the US box office revenue, the second element refers to the Non-US box office (source: Wikipedia).
In this exercise, you’ll combine all these figures into a single vector. Next, you’ll build a matrix from this vector.
new_hope <- c(460.998, 314.4)
empire_strikes <- c(290.475, 247.900)
return_jedi <- c(309.306, 165.8)
In other words, three vectors, all of length 3, containing numerics.
box_office <- c(new_hope, empire_strikes, return_jedi)
In other words, one vector, called box_office, of length 6.
box_office
## [1] 460.998 314.400 290.475 247.900 309.306 165.800
star_wars_matrix <- matrix(box_office, byrow = TRUE, nrow = 3)
In other words, a matrix, called star_wars_matrix, with 3 rows, which has been filled, by row, with the 6 elements of the vector box-office. This forces there to be two elements per row.
What happens if you try to fill a matrix with 3 rows using a vector containing 5 elements, or 7 elements, or 8 elements. Try it and see.
You’ll see that to fill a matrix containing n rows, you need a multiple of n elements. Draw a picture of a matrix to help you understand why.
The force is actually with you!
To help you remember what is stored in star_wars_matrix, you would like to add the names of the movies for the rows. Not only does this help you to read the data, but it is also useful to select certain elements from the matrix.
Previously we used the function names() to name the elements of a vector. As we’re dealing with two-dimensional arrays now, we need to be a little more specific. Similar to vectors, you can add names for the rows and the columns of a matrix
rownames(my_matrix) <- row_names_vector
colnames(my_matrix) <- col_names_vector
Below we’ll create two new vectors for use in this chapter: region, and titles. You will use these vectors to name the columns and rows of star_wars_matrix, respectively.
region <- c("US", "non-US")
titles <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")
colnames(star_wars_matrix) <- region
rownames(star_wars_matrix) <- titles
star_wars_matrix
## US non-US
## A New Hope 460.998 314.4
## The Empire Strikes Back 290.475 247.9
## Return of the Jedi 309.306 165.8
How large is this matrix? Is it still 3 rows and 3 columns? Yes, despite it looking larger now we’ve added names to it, this matrix is still a 3 x 2 matrix. Check this using the dimension function dim().
dim(star_wars_matrix)
## [1] 3 2
Notice it gives you the number of rows first, then the number of columns.
How, using the vector box_office, would you create a matrix with 2 columns and 3 rows? We might look at this in one of our challenges.
The single most important thing for a movie in order to become an instant legend in Tinseltown is its worldwide box office figures.
To calculate the total box office revenue for the three Star Wars movies, you have to take the sum of the US revenue column and the non-US revenue column.
In R, when dealing with marices, the function rowSums() conveniently calculates the totals for each row of a matrix. This function creates a new vector:
rowSums(my_matrix)
which we can assign to a variable.
worldwide_vector <- rowSums(star_wars_matrix)
worldwide_vector
## A New Hope The Empire Strikes Back Return of the Jedi
## 775.398 538.375 475.106
What does this vector tell you exactly?
What about the vector colSums(star_wars_matrix)? What will this vector tell us? Have a thought then give this vector a name.
us_and_abroad_vector <- colSums(star_wars_matrix)
Hopefully you’re starting to understand (if you didn’t already) that when we refer to matrices, we always refer to their rows first, then columns. For example, a 3 x 2 matrix is one with 3 rows and 2 columns.
This is true also when we build a matrix from scratch. Below is code which builds the same matrix we’ve been building over the past few exercises:
box_office <- c(460.998, 314.4, 290.475, 247.900, 309.306, 165.8)
star_wars_matrix <- matrix(box_office, nrow = 3, byrow = TRUE,
dimnames = list(c("A New Hope", "The Empire Strikes Back", "Return of the Jedi"),
c("US", "non-US")))
Notice how rather than use the functions rownames() and colnames(), we used the function dimnames() to names the rows and columns. We wrote dimnames = list(c("row", "names"), c("column", "names")).
We also made use of the list function, which we’ll learn to use fully in a later chapter in this module.
In a previous exercise you calculated the vector that contained the worldwide box office receipt for each of the three Star Wars movies. However, this vector is not yet part of star_wars_matrix.
You can add a column or multiple columns to a matrix with the cbind() function, which merges matrices and/or vectors together by column. For example:
big_matrix <- cbind(matrix1, matrix2, vector1 ...)
Let’s add the worldwide totals to the matrix.
all_wars_matrix <- cbind(star_wars_matrix, worldwide_vector)
Obviously this step will only work if the two matrices have the same number of rows.
After adding this column to our matrix, the logical next step is to add rows. Learn how in the next exercise…
For this exercise, we’ll use our existing matrix star_wars_matrix, which covered the original trilogy of movies, and a second matrix star_wars_matrix2, also 3 x 2, which will contain the same data for the prequels trilogy.
By the way, if you ever want to check out the contents of the workspace you’re working in, you can type ls() in the console.
box_office <- c(474.5, 552.5, 310.7, 338.7, 380.3, 468.5)
star_wars_matrix2 <- matrix(box_office, nrow = 3, byrow = TRUE,
dimnames = list(c("The Phantom Menace", "Attack of the Clones", "Revenge of the Sith"),
c("US", "non-US")))
all_wars_matrix <- rbind(star_wars_matrix, star_wars_matrix2)
Again, we had to be sure, which we were, that the two matrices contained the same number of columns into to be able to successfully merge them.
Would you use colSums()or rowSums() on the matrix all_wars_matrix to calculate the total box office revenue for the entire saga in the US versus abroad?
That’s right! It’s colSums()!
total_revenue_vector <- colSums(all_wars_matrix)
Head over to the next exercise to learn about matrix subsetting.
Similar to vectors, you can use the square brackets [ ] to select one or multiple elements from a matrix. Whereas vectors have one dimension, matrices have two dimensions. You should therefore use a comma to separate the rows you want to select from the columns. Notice again that rows go first, then columns. For example:
my_matrix[1,2] selects the element at the first row and second column.
my_matrix[1:3,2:4] results in a matrix with the data on the rows 1, 2, 3 and columns 2, 3, 4.
If you want to select all elements of a row or a column, no number is needed before or after the comma, respectively:
my_matrix[,1] selects all elements of the first column.
my_matrix[1,] selects all elements of the first row.
What will my_matrix[1] select? Does it even selest anything??
Back to Star Wars with this newly acquired knowledge!
non_us_all <- all_wars_matrix[,2]
mean(non_us_all)
## [1] 347.9667
non_us_some <- all_wars_matrix[1:2,2]
mean(non_us_some)
## [1] 281.15
Similar to what you have learned with vectors, the standard operators like +, -, /, *, etc. work in an element-wise way on matrices in R.
For example, 2 * my_matrix multiplies each element of my_matrix by two.
As a newly-hired data analyst for Lucasfilm, it is your job to find out how many visitors went to each movie for each geographical area. You already have the total revenue figures (in the matrix all_wars_matrix). Assume that the price of a ticket was 5 dollars. Simply dividing the box office numbers by this ticket price gives you the number of visitors.
visitors <- all_wars_matrix / 5
visitors
## US non-US
## A New Hope 92.1996 62.88
## The Empire Strikes Back 58.0950 49.58
## Return of the Jedi 61.8612 33.16
## The Phantom Menace 94.9000 110.50
## Attack of the Clones 62.1400 67.74
## Revenge of the Sith 76.0600 93.70
What do these results tell you? A staggering 92 million people went to see A New Hope in US theaters!
–>
After looking at the result of the previous exercise, big boss Lucas points out that the ticket prices went up over time. He asks to redo the analysis based on the prices you can find in ticket_prices_matrix (source: imagination).
_Those who are familiar with matrices should note that this is not the standard matrix multiplication for which you should use %*% in R._
ticket_prices <- c(5.0, 5.0, 6.0, 6.0, 7.0, 7.0, 4.0, 4.0, 4.5, 4.5, 4.9, 4.9)
ticket_prices_matrix <- matrix(ticket_prices, nrow = 6, byrow = TRUE,
dimnames = list(c("A New Hope", "The Empire Strikes Back", "Return of the Jedi","The Phantom Menace", "Attack of the Clones", "Revenge of the Sith"),
c("US", "non-US")))
visitors <- all_wars_matrix / ticket_prices_matrix
us_visitors <- visitors[,1]
mean(us_visitors)
## [1] 75.01339
This exercise concludes the chapter on matrices. Next stop on your journey through the R language: factors.
Data often falls into a limited number of categories. For example, human hair color can be categorized as black, brown, blond, red, grey, or white—and perhaps a few more options for people who color their hair. In R, categorical data is stored in factors. Factors are very important in data analysis, so start learning how to create, subset, and compare them now.
In this chapter you dive into the wonderful world of factors.
The term factor refers to a statistical data type used to store categorical variables. The difference between a categorical variable and a continuous variable is that a categorical variable can belong to a limited number of categories. A continuous variable, on the other hand, can correspond to an infinite number of values.
It is important that R knows whether it is dealing with a continuous or a categorical variable, as the statistical models you will develop in the future treat both types differently. (You will see later why this is the case.)
A good example of a categorical variable is sex. In many circumstances you can limit the sex categories to “Male” or “Female”. (Sometimes you may need different categories. For example, you may need to consider chromosomal variation, hermaphroditic animals, or different cultural norms, but you will always have a finite number of categories.)
theory <- "factors used for categorical variables"
To create factors in R, you make use of the function factor(). First thing that you have to do is create a vector that contains all the observations that belong to a limited number of categories. For example, sex_vector contains the sex of 5 different individuals:
sex_vector <- c("Male","Female","Female","Male","Male")
It is clear that there are two categories, or in R-terms factor levels, at work here: “Male” and “Female”.
The function factor() will encode the vector as a factor:
factor_sex_vector <- factor(sex_vector)
sex_vector <- c("Male", "Female", "Female", "Male", "Male")
class(sex_vector[2])
## [1] "character"
class(sex_vector)
## [1] "character"
So currently R does not know that sex_vector are (or should be treated as) categories.
factor_sex_vector <- factor(sex_vector)
factor_sex_vector
## [1] Male Female Female Male Male
## Levels: Female Male
If you want to find out more about the factor() function, do not hesitate to type ?factor in the console. This will open up a help page. Continue to the next exercise.
There are two types of categorical variables: a nominal categorical variable and an ordinal categorical variable.
A nominal variable is a categorical variable without an implied order. This means that it is impossible to say that ‘one is worth more than the other’. For example, think of the categorical variable animals_vector with the categories “Elephant”, “Giraffe”, “Donkey” and “Horse”. Here, it is impossible to say that one stands above or below the other. (Note that some of you might disagree ;-) ).
In contrast, ordinal variables do have a natural ordering. Consider for example the categorical variable temperature_vector with the categories: “Low”, “Medium” and “High”. Here it is obvious that “Medium” stands above “Low”, and “High” stands above “Medium”.
animals_vector <- c("Elephant", "Giraffe", "Donkey", "Horse")
factor_animals_vector <- factor(animals_vector)
factor_animals_vector
## [1] Elephant Giraffe Donkey Horse
## Levels: Donkey Elephant Giraffe Horse
temperature_vector <- c("High", "Low", "High","Low", "Medium")
factor_temperature_vector <- factor(temperature_vector, order = TRUE, levels = c("Low", "Medium", "High"))
factor_temperature_vector
## [1] High Low High Low Medium
## Levels: Low < Medium < High
Can you already tell what’s happening in this exercise? Awesome! Continue to the next exercise and get into the details of factor levels.
When you first get a data set, you will often notice that it contains factors with specific factor levels. However, sometimes you will want to change the names of these levels for clarity or other reasons. R allows you to do this with the function levels():
levels(factor_vector) <- c("name1", "name2",...)
A good illustration is the raw data that is provided to you by a survey. A common question for every questionnaire is the sex of the respondent. Here, for simplicity, just two categories were recorded, “M” and “F”. (You usually need more categories for survey data; either way, you use a factor to store the categorical data.)
survey_vector <- c("M", "F", "F", "M", "M")
Recording the sex with the abbreviations “M” and “F” can be convenient if you are collecting data with pen and paper, but it can introduce confusion when analyzing the data. At that point, you will often want to change the factor levels to “Male” and “Female” instead of “M” and “F” for clarity.
Watch out: the order with which you assign the levels is important. If you type levels(factor_survey_vector), you’ll see that it outputs [1] “F” “M”. If you don’t specify the levels of the factor when creating the vector, R will automatically assign them alphabetically. To correctly map “F” to “Female” and “M” to “Male”, the levels should be set to c("Female", "Male"), in this order.
survey_vector <- c("M", "F", "F", "M", "M")
factor_survey_vector <- factor(survey_vector)
levels(factor_survey_vector)
## [1] "F" "M"
levels(factor_survey_vector) <- c("Female", "Male")
After finishing this course, one of your favorite functions in R will be summary(). This will give you a quick overview of the contents of a variable:
summary(my_var)
Going back to our survey, you would like to know how many “Male” responses you have in your study, and how many “Female” responses. The summary() function gives you the answer to this question.
summary(survey_vector)
## Length Class Mode
## 5 character character
summary(factor_survey_vector)
## Female Male
## 2 3
Have a look at the output. The fact that you identified “Male” and “Female” as factor levels in factor_survey_vector enables R to show the number of elements for each category.
You might wonder what happens when you try to compare elements of a factor. In factor_survey_vector you have a factor with two levels: “Male” and “Female”. But how does R value these relative to each other?
male <- factor_survey_vector[1]
female <- factor_survey_vector[2]
male > female
## Warning in Ops.factor(male, female): '>' not meaningful for factors
## [1] NA
By default, R returns NA when you try to compare values in a factor, since the idea doesn’t make sense. Next you’ll learn about ordered factors, where more meaningful comparisons are possible.
Since “Male” and “Female” are unordered (or nominal) factor levels, R returns a warning message, telling you that the greater than operator is not meaningful. As seen before, R attaches an equal value to the levels for such factors.
But this is not always the case! Sometimes you will also deal with factors that do have a natural ordering between its categories. If this is the case, we have to make sure that we pass this information to R…
Let us say that you are leading a research team of five data analysts and that you want to evaluate their performance. To do this, you track their speed, evaluate each analyst as “slow”, “medium” or “fast”, and save the results in speed_vector.
speed_vector <- c("medium", "slow", "slow","medium", "fast")
speed_vector should be converted to an ordinal factor since its categories have a natural ordering. By default, the function factor() transforms speed_vector into an unordered factor. To create an ordered factor, you have to add two additional arguments: ordered and levels.
factor(some_vector,
ordered = TRUE,
levels = c(“lev1”, “lev2” …))
By setting the argument ordered to TRUE in the function factor(), you indicate that the factor is ordered. With the argument levels you give the values of the factor in the correct order.
factor_speed_vector <- factor(speed_vector,
ordered = TRUE,
levels = c("slow", "medium", "fast"))
factor_speed_vector
## [1] medium slow slow medium fast
## Levels: slow < medium < fast
Have a look at the console. It is now indicated that the Levels indeed have an order associated, with the < sign. Continue to the next exercise.
Having a bad day at work, ‘data analyst number two’ enters your office and starts complaining that ‘data analyst number five’ is slowing down the entire project. Since you know that ‘data analyst number two’ has the reputation of being a smarty-pants, you first decide to check if his statement is true.
The fact that factor_speed_vector is now ordered enables us to compare different elements (the data analysts in this case). You can simply do this by using the well-known operators.
da2 <- factor_speed_vector[2]
da5 <- factor_speed_vector[5]
da2 > da5
## [1] FALSE
What do the results tell you? Data analyst two is complaining about the data analyst five while in fact they are the one slowing everything down! This concludes the chapter on factors. With a solid basis in vectors, matrices and factors, you’re ready to dive into the wonderful world of data frames, a very important data structure in R!
Most datasets you will be working with will be stored as data frames. By the end of this chapter, you will be able to create a data frame, select interesting parts of a data frame, and order a data frame according to certain variables.
You may remember from the chapter about matrices that all the elements that you put in a matrix should be of the same type. Back then, your data set on Star Wars only contained numeric elements.
When doing a market research survey, however, you often have questions such as:
‘Are you married?’ or ‘yes/no’ questions (logical) ‘How old are you?’ (numeric) ‘What is your opinion on this product?’ or other ‘open-ended’ questions (character) … The output, namely the respondents’ answers to the questions formulated above, is a data set of different data types. You will often find yourself working with data sets that contain different data types instead of only one.
A data frame has the variables of a data set as columns and the observations as rows. This will be a familiar concept for those coming from different statistical software packages such as SAS or SPSS.
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Wow, that is a lot of cars!
Working with large data sets is not uncommon in data analysis. When you work with (extremely) large data sets and data frames, your first task as a data analyst is to develop a clear understanding of its structure and main elements. Therefore, it is often useful to show only a small part of the entire data set.
So how to do this in R? Well, the function head() enables you to show the first observations of a data frame. Similarly, the function tail() prints out the last observations in your data set.
Both head() and tail() print a top line called the ‘header’, which contains the names of the different variables in your data set.
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
tail(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
So, what do we have in this data set? For example, hp represents the car’s horsepower; the Datsun has the lowest horse power of the 6 cars that are displayed. For a full overview of the variables’ meaning, type ?mtcars in the console and read the help page.
Having used ?mtcars to view an explanation of the variables represented in this dataframe, what does the variable am refer to? And hp?
Another method that is often used to get a rapid overview of your data is the function str(). The function str() shows you the structure of your data set. For a data frame it tells you:
Applying the str() function will often be the first thing that you do when receiving a new data set or data frame. It is a great way to get more insight into your data set before diving into the real analysis.
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Since using built-in data sets is not even half the fun of creating your own data sets, the rest of this chapter is based on your personally developed data set. Put your jet pack on because it is time for some space exploration!
As a first goal, you want to construct a data frame that describes the main characteristics of eight planets in our solar system. According to your good friend Buzz, the main features of a planet are:
After doing some high-quality research on Wikipedia, you feel confident enough to create the necessary vectors: name, type, diameter, rotation and rings.
name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
The first element in each of these vectors correspond to the first observation.
Next we construct a data frame with the data.frame() function. As arguments, you pass the vectors from before: they will become the different columns of your data frame. Because every column has the same length, the vectors you pass should also have the same length. But don’t forget that it is possible (and likely) that they contain different types of data.
planets_df <- data.frame(name, type, diameter, rotation, rings)
The logical next step, as you know by now, is inspecting the data frame you just created.
The planets_df data frame we just created should have 8 observations and 5 variables. Let’s double check!
str(planets_df)
## 'data.frame': 8 obs. of 5 variables:
## $ name : Factor w/ 8 levels "Earth","Jupiter",..: 4 8 1 3 2 6 7 5
## $ type : Factor w/ 2 levels "Gas giant","Terrestrial planet": 2 2 2 2 1 1 1 1
## $ diameter: num 0.382 0.949 1 0.532 11.209 ...
## $ rotation: num 58.64 -243.02 1 1.03 0.41 ...
## $ rings : logi FALSE FALSE FALSE FALSE TRUE TRUE ...
Now that you have a clear understanding of the planets_df data set, it’s time to see how you can select elements from it.
Similar to vectors and matrices, you select elements from a data frame with the help of square brackets [ ]. By using a comma, you can indicate what to select from the rows and the columns respectively. For example:
my_df[1,2] selects the value at the first row and second column in my_df.my_df[1:3,2:4] selects the values that appear in columns 2, 3, 4 of rows 1, 2, 3 in my_df.my_df[1, ] selects all elements of the first row.my_df[, 4] selects all elements of the fourth column.planets_df[1,3]
## [1] 0.382
planets_df[4,]
## name type diameter rotation rings
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
Apart from selecting elements from your data frame by index, you can also use the column names.
Instead of using numerics to select elements of a data frame, you can also use the variable names to select columns of a data frame.
Suppose you want to select the first three elements of the type column. One way to do this is
planets_df[1:3,2]
A possible disadvantage of this approach is that you have to know (or look up) the column number of type, which gets hard if you have a lot of variables. It is often easier to just make use of the variable name:
planets_df[1:3,"type"]
planets_df[1:5, "diameter"]
## [1] 0.382 0.949 1.000 0.532 11.209
planets_df[, "diameter"]
## [1] 0.382 0.949 1.000 0.532 11.209 9.449 4.007 3.883
You will often want to select an entire column, namely one specific variable from a data frame. If you want to select all elements of the variable diameter, for example, both of these will do the trick:
planets_df[,3] planets_df[,"diameter"]
However, there is a short-cut. If your columns have names, you can use the $ sign:
planets_df$diameter
planets_df$diameter
## [1] 0.382 0.949 1.000 0.532 11.209 9.449 4.007 3.883
rings_vector <- planets_df$rings
rings_vector
## [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
Of course this vector is identical to the rings vector we built in the first place, when construct the planets_df dataframe.
Continue to the next exercise and discover yet another way of subsetting!
You probably remember from high school that some planets in our solar system have rings and others do not. Unfortunately you can not recall their names. Could R help you out?
If you type rings_vector in the console, you get:
[1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
This means that the first four observations (or planets) do not have a ring (FALSE), but the other four do (TRUE). However, you do not get a nice overview of the names of these planets, their diameter, etc. Let’s try to use rings_vector to select the data for the four planets with rings.
planets_df[rings_vector,]
## name type diameter rotation rings
## 5 Jupiter Gas giant 11.209 0.41 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 8 Neptune Gas giant 3.883 0.67 TRUE
Notice that we’ve put this logical vector before the comma. This tells the R to display those observations associate with a TRUE value, and not those with a FALSE value.
What would happen if you put rings_vector after the comma? And why? Try it and see!
Using planets_df[rings_vector,] is a rather tedious solution. The next exercise will teach you how to do it in a more concise way.
So what exactly did you learn in the previous exercises? You selected a subset from a data frame (planets_df) based on whether or not a certain condition was true (rings or no rings), and you managed to pull out all relevant data. Pretty awesome! By now, NASA is probably already flirting with your CV ;-).
Now, let us move up one level and use the function subset(). You should see the subset() function as a short-cut to do exactly the same as what you did in the previous exercises.
subset(my_df, subset = some_condition)
The first argument of subset() specifies the data set for which you want a subset. By adding the second argument, you give R the necessary information and conditions to select the correct subset.
The code below will give the exact same result as you got in the previous exercise, but this time, you didn’t need to create the rings_vector!
subset(planets_df, subset = rings)
subset(planets_df, diameter < 1)
## name type diameter rotation rings
## 1 Mercury Terrestrial planet 0.382 58.64 FALSE
## 2 Venus Terrestrial planet 0.949 -243.02 FALSE
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
subset(planets_df, rings == TRUE)
## name type diameter rotation rings
## 5 Jupiter Gas giant 11.209 0.41 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 8 Neptune Gas giant 3.883 0.67 TRUE
Not only is the subset() function more concise, it is probably also more understandable for people who read your code.
Making and creating rankings is one of mankind’s favorite affairs. These rankings can be useful (best universities in the world), entertaining (most influential movie stars) or pointless (best 007 look-a-like).
In data analysis you can sort your data according to a certain variable in the data set. In R, this is done with the help of the function order().
order() is a function that gives you the ranked position of each element when it is applied on a variable, such as a vector for example:
a <- c(1000, 10, 100)
order(a)
## [1] 2 3 1
a
## [1] 1000 10 100
10, which is the second element in a, is the smallest element, so 2 comes first in the output of order(a). 100, which is the third element in a is the second smallest element, so 3 comes second in the output of order(a).
Note that order(a) has not altered a itself.
We can use the output of order(a) to reshuffle a:
a[order(a)]
## [1] 10 100 1000
Once more, be aware that we haven’t altered a at all. We could, by assigning a[order(a)] to a (or to b if we wanted to keep a intact).
Now let’s use the order() function to sort your data frame!
Alright, now that you understand the order() function, let us do something useful with it. You would like to rearrange your data frame such that it starts with the smallest planet and ends with the largest one. This will be a sort on the diameter column.
positions <- order(planets_df$diameter)
planets_df[positions, ]
## name type diameter rotation rings
## 1 Mercury Terrestrial planet 0.382 58.64 FALSE
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
## 2 Venus Terrestrial planet 0.949 -243.02 FALSE
## 3 Earth Terrestrial planet 1.000 1.00 FALSE
## 8 Neptune Gas giant 3.883 0.67 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 5 Jupiter Gas giant 11.209 0.41 TRUE
This exercise concludes the chapter on data frames. Remember that data frames are extremely important in R, you will need them all the time. Another very often used data structure is the list. This will be the subject of the next chapter!
As opposed to vectors, lists can hold components of different types, just as your to-do lists can contain different categories of tasks. This chapter will teach you how to create, name, and subset these lists.
Congratulations! At this point in the course you are already familiar with:
Vectors (one dimensional array): can hold numeric, character or logical values. The elements in a vector all have the same data type.
Matrices (two dimensional array): can hold numeric, character or logical values. The elements in a matrix all have the same data type.
Data frames (two-dimensional objects): can hold numeric, character or logical values. Within a column all elements have the same data type, but different columns can be of different data type.
Pretty sweet for an R newbie, right? ;-)
A list in R is similar to your to-do list at work or school: the different items on that list most likely differ in length, characteristic, and type of activity that has to be done.
A list in R allows you to gather a variety of objects under one name (that is, the name of the list) in an ordered way. These objects can be matrices, vectors, data frames, even other lists, etc. It is not even required that these objects are related to each other in any way.
You could say that a list is some kind super data type: you can store practically any piece of information in it!
Cool. Let’s get our hands dirty!
Let us create our first list! To construct a list you use the function list():
my_list <- list(comp1, comp2 ...)
The arguments to the list function are the list components. Remember, these components can be matrices, vectors, other lists, …
my_vector <- 1:10
my_matrix <- matrix(1:9, ncol = 3)
my_df <- mtcars[1:10,]
my_list <- list(my_vector, my_matrix, my_df)
my_list
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
##
## [[2]]
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
##
## [[3]]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Well done, you’re on a roll!
Just like on your to-do list, you want to avoid not knowing or remembering what the components of your list stand for. That is why you should give names to them:
my_list <- list(name1 = your_comp1, name2 = your_comp2)
This creates a list with components that are named name1, name2, and so on. If you want to name your lists after you’ve created them, you can use the names() function as you did with vectors. The following commands are fully equivalent to the assignment above:
my_list <- list(your_comp1, your_comp2) names(my_list) <- c("name1", "name2")
my_list <- list(vec = my_vector, mat = my_matrix, df = my_df)
my_list
## $vec
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $mat
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
##
## $df
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Great! Not only do you know how to construct lists now, you can also name them; a skill that will prove most useful in practice. Continue to the next exercise.
Being a huge movie fan (remember your job at LucasFilms), you decide to start storing information on good movies with the help of lists.
Start by creating a list for the movie “The Shining”.
mov <- "The Shining"
act <- c("Jack Nicholson", "Shelley Duvall", "Danny Lloyd","Scatman Crothers", "Barry Nelson")
scores <- c(4.5, 4, 5)
sources <- c("IMDb1", "IMDb2", "IMDb3")
comments <- c("Best Horror Film I Have Ever Seen", "A truly brilliant and scary film from Stanley Kubrick", "A masterpiece of psychological horror")
rev <- data.frame(scores, sources, comments)
shining_list <- list(moviename = mov, actors = act, reviews = rev)
Wonderful! You now know how to construct and name lists. As in the previous chapters, let’s look at how to select elements for lists. Head over to the next exercise
Your list will often be built out of numerous elements and components. Therefore, getting a single element, multiple elements, or a component out of it is not always straightforward.
One way to select a component is using the numbered position of that component. For example, to “grab” the first component of shining_list you type
shining_list[[1]]
A quick way to check this out is typing it in the console. Important to remember: to select elements from vectors, you use single square brackets: [ ]. Don’t mix them up!
You can also refer to the names of the components, with [[ ]] or with the $ sign. Two ways of selecting the data frame representing the reviews:
shining_list[["reviews"]]
shining_list$reviews
Besides selecting components, you often need to select specific elements out of these components. For example, with shining_list[[2]][1] you select from the second component, actors (shining_list[[2]]), the first element ([1]). When you type this in the console, you will see the answer is Jack Nicholson.
shining_list$actors
## [1] "Jack Nicholson" "Shelley Duvall" "Danny Lloyd" "Scatman Crothers"
## [5] "Barry Nelson"
shining_list$actors[2]
## [1] "Shelley Duvall"
Great! Selecting elements from lists is rather easy isn’t it? Continue to the next exercise.
You found reviews of another, more recent, Jack Nicholson movie: The Departed!
Scores Comments
4.6 I would watch it again
5 Amazing!
4.8 I liked it
5 One of the best movies
4.2 Fascinating plot
It would be useful to collect together all the pieces of information about the movie, like the title, actors, and reviews into a single variable. Since these pieces of data are different shapes, it is natural to combine them in a list variable.
movie_title <- "The Departed"
movie_actors <- c("Leonardo DiCaprio", "Matt Damon", "Jack Nicholson", "Mark Wahlberg", "Vera Farmiga", "Martin Sheen")
scores <- c(4.6, 5, 4.8, 5, 4.2)
comments <- c("I would watch it again", "Amazing!", "I liked it", "One of the best movies", "Fascinating plot")
avg_review <- mean(scores)
reviews_df <- data.frame(scores, comments)
departed_list <- list(movie_title, movie_actors, reviews_df, avg_review)
departed_list
## [[1]]
## [1] "The Departed"
##
## [[2]]
## [1] "Leonardo DiCaprio" "Matt Damon" "Jack Nicholson"
## [4] "Mark Wahlberg" "Vera Farmiga" "Martin Sheen"
##
## [[3]]
## scores comments
## 1 4.6 I would watch it again
## 2 5.0 Amazing!
## 3 4.8 I liked it
## 4 5.0 One of the best movies
## 5 4.2 Fascinating plot
##
## [[4]]
## [1] 4.72
Good work! You successfully created another list of movie information, and combined different components into a single list. Congratulations on finishing the course!
Intermediate R is the next stop on your journey in mastering the R programming language. In this R training, you will learn about conditional statements, loops, and functions to power your own R scripts. Next, make your R code more efficient and readable using the apply functions. Finally, the utilities chapter gets you up to speed with regular expressions in R, data structure manipulations, and times and dates. This course will allow you to take the next step in advancing your overall knowledge and capabilities while programming in R.
In this chapter, you’ll learn about relational operators for comparing R objects, and logical operators like “and” and “or” for combining TRUE and FALSE values. Then, you’ll use this knowledge to build conditional statements.
The most basic form of comparison is equality. Let’s briefly recap its syntax. The following statements all evaluate to TRUE (feel free to try them out in the console).
3 == (2 + 1)
## [1] TRUE
"intermediate" != "r"
## [1] TRUE
TRUE != FALSE
## [1] TRUE
"Rchitect" != "rchitect"
## [1] TRUE
Notice from the last expression that R is case sensitive: “R” is not equal to “r”. Keep this in mind when solving the exercises in this chapter!
TRUE == FALSE
## [1] FALSE
-6 * 14 != 17 - 101
## [1] FALSE
"useR" == "user"
## [1] FALSE
TRUE == 1
## [1] TRUE
Awesome! Since TRUE coerces to 1 under the hood, TRUE == 1 evaluates to TRUE. Make sure not to mix up == (comparison) and = (using for settings in functions). == is what needed to check the equality of R objects.
Apart from equality operators, Filip also introduced the less than and greater than operators: < and >. You can also add an equal sign to express less than or equal to or greater than or equal to, respectively. Have a look at the following R expressions, that all evaluate to FALSE:
(1 + 2) > 4
"dog" < "Cats"
TRUE <= FALSE
Remember that for string comparison, R determines the greater than relationship based on alphabetical order. Also, keep in mind that TRUE is treated as 1 for arithmetic, and FALSE is treated as 0. Therefore, FALSE < TRUE is TRUE.
-6 * 5 + 2 >= -10 + 1
## [1] FALSE
"raining" <= "raining dogs"
## [1] TRUE
TRUE > FALSE
## [1] TRUE
Make sure to have a look at the console output to see if R returns the results you expected.
You are already aware that R is very good with vectors. Without having to change anything about the syntax, R’s relational operators also work on vectors. But be careful: the comparison is element-by-element, so the tow vectors must have the same number of elements (i.e. they must be the same length).
Let’s go back to the example that was started in the video. You want to figure out whether your activity on social media platforms have paid off and decide to look at your results for LinkedIn and Facebook. The sample code in the editor initializes the vectors linkedin and facebook. Each of the vectors contains the number of profile views your LinkedIn and Facebook profiles had over the last seven days.
linkedin <- c(16, 9, 13, 5, 2, 17, 14)
facebook <- c(17, 7, 5, 16, 8, 13, 14)
linkedin > 15
## [1] TRUE FALSE FALSE FALSE FALSE TRUE FALSE
linkedin <= 5
## [1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE
linkedin > facebook
## [1] FALSE TRUE TRUE FALSE FALSE TRUE FALSE
Have a look at the console output. Your LinkedIn profile was pretty popular on the sixth day, but less so on the fourth and fifth day.
R’s ability to deal with different data structures for comparisons does not stop at vectors. Matrices and relational operators also work together seamlessly!
First we’ll store the LinkedIn and Facebook data in a matrix (rather than in vectors). We’ll call this matrix views. The first row contains the LinkedIn information; the second row the Facebook information. The original vectors facebook and linkedin are still available as well.
views == 13
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE TRUE FALSE
views <= 14
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] FALSE TRUE TRUE TRUE TRUE FALSE TRUE
## [2,] FALSE TRUE TRUE FALSE TRUE TRUE TRUE
This exercise concludes the part on comparators. Now that you know how to query the relation between R objects, the next step will be to use the results to alter the behavior of your programs. Find out all about that in the next video!
Before you work your way through the next exercises, have a look at the following R expressions. All of them will evaluate to TRUE:
TRUE & TRUE
FALSE | TRUE
5 <= 5 & 2 < 3
3 < 4 | 7 < 6
Watch out: 3 < x < 7 to check if x is between 3 and 7 will not work; you’ll need 3 < x & x < 7 for that.
In this exercise, you’ll be working with the last variable. We’ll make this variable equal the last value of the linkedin vector that you’ve worked with previously. The linkedin vector represents the number of LinkedIn views your profile had in the last seven days, remember?
last <- tail(linkedin, 1)
last < 5 | last > 10
## [1] TRUE
last > 15 & last <= 20
## [1] FALSE
Have one last look at the console before proceeding; do the results of the different expressions make sense?
Like relational operators, logical operators work perfectly fine with vectors and matrices.
Ready for some advanced queries to gain more insights into your social outreach?
linkedin > 10 & facebook < 10
## [1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE
linkedin >= 12 | facebook >= 12
## [1] TRUE FALSE TRUE TRUE FALSE TRUE TRUE
views > 11 & views <= 14
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] FALSE FALSE TRUE FALSE FALSE FALSE TRUE
## [2,] FALSE FALSE FALSE FALSE FALSE TRUE TRUE
You’ll have noticed how easy it is to use logical operators to vectors and matrices. What do these results tell us? The third day of the recordings was the only day where your LinkedIn profile was visited more than 10 times, while your Facebook profile wasn’t. Can you draw similar conclusions for the other results?
With the things you’ve learned by now, you’re able to solve pretty cool problems.
Instead of recording the number of views for your own LinkedIn profile, suppose you conducted a survey inside the company you’re working for. You’ve asked every employee with a LinkedIn profile how many visits their profile has had over the past seven days. The data will be stored in the matrix employees_views, then a data fram li_df created from this matrix, with appropriate names for the rows. Finally, names will be added to the columns.
employees_views <- matrix(c(2, 3, 3, 6, 4, 2, 0, 19, 23, 18, 22, 23, 29, 25, 24, 18, 15, 19, 18, 22, 17, 22, 18, 27, 26, 19, 21, 25, 25, 25, 26, 31, 24, 36, 37, 22, 20, 29, 26, 23, 22, 29, 0, 4, 2, 2, 3, 4, 2, 12, 3, 15, 7, 1, 15, 11, 19, 22, 22, 19, 25, 24, 23, 23, 12, 19, 25, 18, 22, 22, 29, 27, 23, 25, 29, 30, 17, 13, 13, 20, 17, 12, 22, 20, 7, 17, 9, 5, 11, 9, 9, 26, 27, 28, 36, 29, 31, 30, 7, 6, 4, 11, 5, 5, 15, 32, 35, 31, 35, 24, 25, 36, 7, 17, 9, 12, 13, 6, 12, 9, 6, 3, 12, 3, 8, 6, 0, 1, 11, 6, 0, 4, 11, 9, 12, 6, 13, 12, 13, 11, 6, 15, 15, 10, 9, 7, 18, 17, 17, 12, 4, 14, 17, 7, 1, 12, 8, 2, 4, 4, 11, 5, 8, 0, 1, 6, 3, 1, 2, 7, 5, 3, 1, 5, 5, 29, 25, 32, 28, 28, 27, 27, 17, 15, 17, 23, 23, 17, 22, 26, 32, 33, 30, 33, 28, 26, 27, 29, 24, 29, 26, 31, 28, 4, 1, 1, 2, 1, 7, 4, 22, 22, 17, 20, 14, 19, 13, 9, 11, 7, 10, 8, 15, 5, 6, 5, 12, 5, 17, 17, 4, 18, 17, 12, 22, 22, 13, 12, 2, 12, 13, 7, 10, 6, 2, 32, 26, 20, 23, 24, 25, 21, 5, 13, 12, 11, 6, 5, 10, 6, 10, 11, 6, 6, 2, 5, 30, 37, 32, 35, 37, 41, 42, 34, 33, 32, 35, 33, 27, 35, 15, 19, 21, 18, 22, 26, 22, 28, 29, 30, 19, 21, 19, 26, 6, 8, 6, 7, 17, 11, 14, 17, 22, 27, 24, 18, 28, 24, 6, 10, 17, 18, 13, 10, 7, 18, 19, 22, 17, 21, 15, 23, 21, 27, 28, 28, 26, 17, 25, 10, 18, 20, 18, 12, 19, 17, 6, 15, 15, 15, 10, 14, 2, 30, 28, 29, 31, 24, 20, 25), nrow = 50, byrow = TRUE)
li_df <- data.frame(employees_views, row.names = c("employee_1", "employee_2", "employee_3", "employee_4", "employee_5", "employee_6", "employee_7", "employee_8", "employee_9", "employee_10", "employee_11", "employee_12", "employee_13", "employee_14", "employee_15", "employee_16", "employee_17", "employee_18", "employee_19", "employee_20", "employee_21", "employee_22", "employee_23", "employee_24", "employee_25", "employee_26", "employee_27", "employee_28", "employee_29", "employee_30", "employee_31", "employee_32", "employee_33", "employee_34", "employee_35", "employee_36", "employee_37", "employee_38", "employee_39", "employee_40", "employee_41", "employee_42", "employee_43", "employee_44", "employee_45", "employee_46", "employee_47", "employee_48", "employee_49", "employee_50") )
names(li_df)[1] <- "day1"
names(li_df)[2] <- "day2"
names(li_df)[3] <- "day3"
names(li_df)[4] <- "day4"
names(li_df)[5] <- "day5"
names(li_df)[6] <- "day6"
names(li_df)[7] <- "day7"
second <- li_df[, 2]
extremes <- second < 5 | second > 25
sum(extremes)
## [1] 16
Head over to the next video and learn how relational and logical operators can be used to alter the flow of your R scripts.
Before diving into some exercises on the if statement, have another look at its syntax:
if (condition) {
expr
}
Remember your vectors with social profile views? Let’s look at it from another angle. We create a variable called medium which gives information about the social website, and another called num_views which denotes the actual number of views that particular medium had on the last day of your recordings.
Defining these variables related to your last day of recordings
medium <- "LinkedIn"
num_views <- 14
if (medium == "LinkedIn") {
print("Showing LinkedIn information")
}
## [1] "Showing LinkedIn information"
if (num_views > 15) {
print("You are popular!")
}
Try to see what happens if you change the medium and num_views variables and run your code again. Let’s further customize these if statements in the next exercise.
You can only use an else statement in combination with an if statement. The else statement does not require a condition; its corresponding code is simply run if all of the preceding conditions in the control structure are FALSE. Here’s a recipe for its usage:
if (condition) {
expr1
} else {
expr2
}
It’s important that the else keyword comes on the same line as the closing bracket of the if part!
We will now extend the if statements that we coded in the previous exercises with the appropriate else statements!
if (medium == “LinkedIn”) { print(“Showing LinkedIn information”) } else { print(“Unknown medium”) }
if (num_views > 15) { print(“You’re popular!”) } else { print(“Try to be more visible!”) }
You also had Facebook information available, remember? Time to add some more statements to our control structures using else if!
The else if statement allows you to further customize your control structure. You can add as many else if statements as you like. Keep in mind that R ignores the remainder of the control structure once a condition has been found that is TRUE and the corresponding expressions have been executed. Here’s an overview of the syntax to freshen your memory:
if (condition1) {
expr1
} else if (condition2) {
expr2
} else if (condition3) {
expr3
} else {
expr4
}
Again, it’s important that the else if keywords come on the same line as the closing bracket of the previous part of the control construct.
if (medium == "LinkedIn") {
print("Showing LinkedIn information")
} else if (medium == "Facebook") {
print("Showing Facebook information")
} else {
print("Unknown medium")
}
## [1] "Showing LinkedIn information"
if (num_views > 15) {
print("You're popular!")
} else if (num_views <= 15 & num_views > 10) {
print("Your number of views is average")
} else {
print("Try to be more visible!")
}
## [1] "Your number of views is average"
Have another look at the second control structure. Because R abandons the control flow as soon as it finds a condition that is met, you can simplify the condition for the else if part in the second construct to num_views > 10.
In this exercise, you will combine everything that you’ve learned so far: relational operators, logical operators and control constructs. You’ll need it all!
li <- 15
fb <- 9
These two variables, li and fb denote the number of profile views your LinkedIn and Facebook profile had on the last day of recordings. Go through the instructions to create R code that generates a ‘social media score’, sms, based on the values of li and fb.
if (li >= 15 & fb >= 15) {
sms <- 2 * (li + fb)
} else if (li < 10 & fb < 10) {
sms <- 0.5 * (li + fb)
} else {
sms <- li + fb
}
sms
## [1] 24
Feel free to play around some more with your solution by changing the values of li and fb.
Loops can come in handy on numerous occasions. While loops are like repeated if statements, the for loop is designed to iterate over all elements in a sequence. Learn about them in this chapter.
Let’s get you started with building a while loop from the ground up. Have another look at its recipe:
while (condition) {
expr
}
Remember that the condition part of this recipe should becomeFALSEat some point during the execution. Otherwise, thewhile` loop will go on indefinitely.
If your session expires when you run your code, check the body of your while loop carefully.
Have a look at the code below; it initializes the speed variables and already provides a while loop template to get you started.
speed <- 64
while (speed > 30) {
print("Slow down!")
speed <- speed - 7
}
## [1] "Slow down!"
## [1] "Slow down!"
## [1] "Slow down!"
## [1] "Slow down!"
## [1] "Slow down!"
speed
## [1] 29
In the previous exercise, you simulated the interaction between a driver and a driver’s assistant: When the speed was too high, “Slow down!” got printed out to the console, resulting in a decrease of your speed by 7 units.
There are several ways in which you could make your driver’s assistant more advanced. For example, the assistant could give you different messages based on your speed or provide you with a current speed at a given moment.
A while loop similar to the one you’ve coded in the previous exercise is already available in the editor. It prints out your current speed, but there’s no code that decreases the speed variable yet, which is pretty dangerous. Can you make the appropriate changes?
Note that we’ll need to assign the value of 64 to the variable speed, as it currently has the value 29.
speed <- 64
while (speed > 30) {
print(paste("Your speed is",speed))
if (speed > 48) {
print("Slow down big time!")
speed <- speed - 11
} else {
print("Slow down!")
speed <- speed - 6
}
}
## [1] "Your speed is 64"
## [1] "Slow down big time!"
## [1] "Your speed is 53"
## [1] "Slow down big time!"
## [1] "Your speed is 42"
## [1] "Slow down!"
## [1] "Your speed is 36"
## [1] "Slow down!"
To further improve our driver assistant model, head over to the next exercise!
There are some very rare situations in which severe speeding is necessary: what if a hurricane is approaching and you have to get away as quickly as possible? You don’t want the driver’s assistant sending you speeding notifications in that scenario, right?
This seems like a great opportunity to include the break statement in the while loop you’ve been working on. Remember that the break statement is a control statement. When R encounters it, the while loop is abandoned completely.
Once again, we begin by initialising the speed variable.
speed <- 88
while (speed > 30) {
print(paste("Your speed is", speed))
if (speed > 80) {
break
}
if (speed > 48) {
print("Slow down big time!")
speed <- speed - 11
} else {
print("Slow down!")
speed <- speed - 6
}
}
## [1] "Your speed is 88"
Now that you’ve correctly solved this exercise, feel free to play around with different values of speed to see how the while loop handles the different cases.
The previous exercises guided you through developing a pretty advanced while loop, containing a break statement and different messages and updates as determined by control flow constructs. If you manage to solve this comprehensive exercise using a while loop, you’re totally ready for the next topic: the for loop.
i <- 1
while (i <= 10) {
print(3 * i)
if (3 * i %% 8 == 0) {
break
}
i <- i + 1
}
## [1] 3
## [1] 6
## [1] 9
## [1] 12
## [1] 15
## [1] 18
## [1] 21
## [1] 24
Head over to the next video!
Loop over a vector In the previous video, Filip told you about two different strategies for using the for loop. To refresh your memory, consider the following loops that are equivalent in R:
primes <- c(2, 3, 5, 7, 11, 13)
# loop version 1
for (p in primes) {
print(p)
}
# loop version 2
for (i in 1:length(primes)) {
print(primes[i])
}
Remember our linkedin vector? It’s a vector that contains the number of views your LinkedIn profile had in the last seven days. Let’s remember ourselves of the vector.
linkedin
## [1] 16 9 13 5 2 17 14
for(elements in linkedin) {
print (elements)
}
## [1] 16
## [1] 9
## [1] 13
## [1] 5
## [1] 2
## [1] 17
## [1] 14
for(i in 1:length(linkedin)) {
print (linkedin[i])
}
## [1] 16
## [1] 9
## [1] 13
## [1] 5
## [1] 2
## [1] 17
## [1] 14
Looping over a list is just as easy and convenient as looping over a vector. There are again two different approaches here:
primes_list <- list(2, 3, 5, 7, 11, 13)
#### loop version 1
for (p in primes_list) {
print(p)
}
#### loop version 2
for (i in 1:length(primes_list)) {
print(primes_list[[i]])
}
Recall from earlier that to select elements from lists, rather than single square brackets, we need double square brackets [[ ]]. You will see them again in loop version 2 above.
Suppose you have a list of all sorts of information on New York City: its population size, the names of the boroughs, and whether it is the capital of the United States. We first prepare a list nyc with all this information (source: Wikipedia).
nyc <- list(pop = 8405837,
boroughs = c("Manhattan", "Bronx", "Brooklyn", "Queens", "Staten Island"),
capital = FALSE)
for(elements in nyc) {
print(elements)
}
## [1] 8405837
## [1] "Manhattan" "Bronx" "Brooklyn" "Queens"
## [5] "Staten Island"
## [1] FALSE
for(i in 1:length(nyc)) {
print(nyc[[i]])
}
## [1] 8405837
## [1] "Manhattan" "Bronx" "Brooklyn" "Queens"
## [5] "Staten Island"
## [1] FALSE
Filip mentioned that for loops can also be used for matrices. Let’s put that to a test in the next exercise.
We’ll define a matrix ttt, that represents the status of a tic-tac-toe game. It contains the values “X”, “O” and “NA”. We’ll print out ttt in the console once it’s be defined to get a closer look. On row 1 and column 1, there’s “O”, while on row 3 and column 2 there’s “NA”.
ttt <- matrix(c("O", NA, "X", NA, "O", "O", "X", NA, "X"), byrow = TRUE, nrow = 3)
ttt
## [,1] [,2] [,3]
## [1,] "O" NA "X"
## [2,] NA "O" "O"
## [3,] "X" NA "X"
To solve this exercise, you’ll need a for loop inside a for loop, often called a nested loop. Doing this in R is a breeze! Simply use the following recipe:
for (var1 in seq1) {
for (var2 in seq2) {
expr
}
}
for (i in 1:nrow(ttt)) {
for (j in 1:ncol(ttt)) {
print(paste("On row ", i, " and column ", j, " the board contains ", ttt[i,j]))
}
}
## [1] "On row 1 and column 1 the board contains O"
## [1] "On row 1 and column 2 the board contains NA"
## [1] "On row 1 and column 3 the board contains X"
## [1] "On row 2 and column 1 the board contains NA"
## [1] "On row 2 and column 2 the board contains O"
## [1] "On row 2 and column 3 the board contains O"
## [1] "On row 3 and column 1 the board contains X"
## [1] "On row 3 and column 2 the board contains NA"
## [1] "On row 3 and column 3 the board contains X"
Notice that this loop when through the whole of row 1 before moving onto row 2. This makes sense, as the rows are the outer loop and columns are the inner loop.
You’re sufficiently comfortable with basic for looping, so it’s time to step it up a notch!
Let’s return to the LinkedIn profile views data, stored in a vector linkedin. In the first exercise on for loops you already did a simple printout of each element in this vector. A little more in-depth interpretation of this data wouldn’t hurt, right? Time to throw in some conditionals! As with the while loop, you can use the if and else statements inside the for loop.
for (li in linkedin) {
if (li > 10) {
print ("You're popular!")
} else {
print ("Be more visible!")
}
print(li)
}
## [1] "You're popular!"
## [1] 16
## [1] "Be more visible!"
## [1] 9
## [1] "You're popular!"
## [1] 13
## [1] "Be more visible!"
## [1] 5
## [1] "Be more visible!"
## [1] 2
## [1] "You're popular!"
## [1] 17
## [1] "You're popular!"
## [1] 14
In the next exercise, you’ll customize this for loop even further with break and next statements.
In the editor on the right you’ll find a possible solution to the previous exercise. The code loops over the linkedin vector and prints out different messages depending on the values of li.
In this exercise, you will use the break and next statements:
The break statement abandons the active loop: the remaining code in the loop is skipped and the loop is not iterated over anymore. The next statement skips the remainder of the code in the loop, but continues the iteration.
for (li in linkedin) {
if (li > 16) {
print ("This is ridiculous, I'm outta here!")
break
}
if (li < 5) {
print ("This is too embarrassing!")
next
}
if (li > 10) {
print("You're popular!")
} else {
print("Be more visible!")
}
print(li)
}
## [1] "You're popular!"
## [1] 16
## [1] "Be more visible!"
## [1] 9
## [1] "You're popular!"
## [1] 13
## [1] "Be more visible!"
## [1] 5
## [1] "This is too embarrassing!"
## [1] "This is ridiculous, I'm outta here!"
for, break, next? We name it, you can do it!
This exercise will not introduce any new concepts on for loops.
We first define a variable rquote, then split this variable up into a vector that contains separate letters, and store them in a vector chars using the strsplit() function.
Can you write code that counts the number of r’s that come before the first u in rquote?
rquote <- "r's internals are irrefutably intriguing"
chars <- strsplit(rquote, split = "")[[1]]
rcount <- 0
for (char in chars) {
if (char == "u") {
break
}
if (char == "r") {
rcount <- rcount + 1
}
}
rcount
## [1] 5
For-midable! This exercise concludes the chapter on while and for loops.
Functions are an extremely important concept in almost every programming language, and R is no different. Learn what functions are and how to use them—then take charge by writing your own functions.
Before even thinking of using an R function, you should clarify which arguments it expects. All the relevant details such as a description, usage, and arguments can be found in the documentation. To consult the documentation on the sample() function, for example, you can use one of following R commands:
help(sample)
?sample
If you execute these commands in the console of the DataCamp interface, you’ll be redirected to www.rdocumentation.org. If you execute these commands in the console of an IDE (integrated development environment) such as RStudio, the documentation will open in the Help panel.
A quick hack to see the arguments of the sample() function is the args() function. Try it out in the console:
args(sample)
In the next exercises, you’ll be learning how to use the mean() function with increasing complexity. The first thing you’ll have to do is get acquainted with the mean() function.
args(mean)
## function (x, ...)
## NULL
That wasn’t too hard, was it? Take a look at the documentation and head over to the next exercise.
The documentation on the mean() function gives us quite some information:
mean() function computes the arithmetic mean.x and ....x argument should be a vector containing numeric, logical or time-related information. (Remember what we learnt about the numeric values of TRUE and FALSE to understand how you could take an average of logical values!)Remember that R can match arguments both by position and by name. Can you still remember the difference? You’ll find out in this exercise!
Once more, you’ll be working with the view counts of your social network profiles for the past 7 days.
avg_li <- mean(linkedin)
avg_fb <- mean(facebook)
avg_li
## [1] 10.85714
avg_fb
## [1] 11.42857
I’m sure you’ve already called more advanced R functions in your history as a programmer. Now you also know what actually happens under the hood ;-)
Check the documentation on the mean() function again:
?mean
The Usage section of the documentation includes two versions of the mean() function. The first usage,
mean(x, ...)
is the most general usage of the mean function. The ‘Default S3 method’, however, is:
mean(x, trim = 0, na.rm = FALSE, ...)
The ... is called the ellipsis. It is a way for R to pass arguments along without the function having to name them explicitly. The ellipsis will be treated in more detail in future courses.
For the remainder of this exercise, just work with the second usage of the mean function. Notice that both trim and na.rm have default values. This makes them optional arguments.
avg_sum <- mean(linkedin + facebook)
avg_sum_trimmed <- mean(linkedin + facebook, trim = 0.2)
avg_sum
## [1] 22.28571
avg_sum_trimmed
## [1] 22.6
When the trim argument is not zero, it chops off a fraction (equal to trim) of the vector you pass as argument x.
In the video, Filip guided you through the example of specifying arguments of the sd() function. The sd() function has an optional argument, na.rm that specified whether or not to remove missing values from the input vector before calculating the standard deviation.
If you’ve had a good look at the documentation, you’ll know by now that the mean() function also has this argument, na.rm, and it does the exact same thing. By default, it is set to FALSE, as the Usage of the default S3 method shows:
mean(x, trim = 0, na.rm = FALSE, ...)
Let’s see what happens if your vectors linkedin and facebook contain missing values (NA).
linkedin <- c(16, 9, 13, 5, NA, 17, 14)
facebook <- c(17, NA, 5, 16, 8, 13, 14)
mean(linkedin)
## [1] NA
mean(linkedin, na.rm = TRUE)
## [1] 12.33333
You already know that R functions return objects that you can then use somewhere else. This makes it easy to use functions inside functions, as you’ve seen before:
speed <- 31
print(paste("Your speed is", speed))
Notice that both the print() and paste() functions use the ellipsis - … - as an argument. Can you figure out how they’re used?
mean(abs(linkedin - facebook), na.rm = TRUE)
## [1] 4.8
Using functions that are already available in R is pretty straightforward, but how about writing your own functions to supercharge your R programs? The next video will tell you how.
Wow, things are getting serious… you’re about to write your own function! Before you have a go at it, have a look at the following function template:
my_fun <- function(arg1, arg2) {
body
}
Notice that this recipe uses the assignment operator (<-) just as if you were assigning a vector to a variable for example. This is not a coincidence. Creating a function in R basically is the assignment of a function object to a variable! In the recipe above, you’re creating a new R variable my_fun, that becomes available in the workspace as soon as you execute the definition. From then on, you can use the my_fun as a function.
pow_two <- function(x) {
x ^ 2
}
pow_two(12)
## [1] 144
sum_abs <- function(x, y) {
abs(x) + abs(y)
}
sum_abs(-2, 3)
## [1] 5
Step it up a notch in the next exercise!
There are situations in which your function does not require an input. Let’s say you want to write a function that gives us the random outcome of throwing a fair die:
throw_die <- function() {
number <- sample(1:6, size = 1)
number
}
throw_die()
Up to you to code a function that doesn’t take any arguments!
hello <- function() {
print("Hi there!")
return(TRUE)
}
hello()
## [1] "Hi there!"
## [1] TRUE
Do you still remember the difference between an argument with and without default values? Have another look at the sd() function by typing ?sd in the console. The usage section shows the following information:
sd(x, na.rm = FALSE)
This tells us that x has to be defined for the sd() function to be called correctly, however, na.rm already has a default value. Not specifying this argument won’t cause an error.
You can define default argument values in your own R functions as well. You can use the following recipe to do so:
my_fun <- function(arg1, arg2 = val2) {
body
}
The editor on the right already includes an extended version of the pow_two() function from before. Can you finish it?
pow_two <- function(x, print_info = TRUE) {
y <- x ^ 2
if (print_info == TRUE) {
print(paste(x, "to the power two equals", y))
}
return(y)
}
pow_two(12)
## [1] "12 to the power two equals 144"
## [1] 144
pow_two(12, print_info = TRUE)
## [1] "12 to the power two equals 144"
## [1] 144
pow_two(12, print_info = FALSE)
## [1] 144
Have you tried calling this pow_two() function? Try pow_two(5), pow_two(5, TRUE) and pow_two(5, FALSE). Which ones give different results?
Normally I don’t write the question text here. However, in the case of this question, I think it’s useful. The question goes like this…
An issue that Filip did not discuss in the video is function scoping. It implies that variables that are defined inside a function are not accessible outside that function. Try running the following code and see if you understand the results:
pow_two <- function(x) {
y <- x ^ 2
return(y)
}
pow_two(4)
## [1] 16
Did you trying calling y and x? Did you receive an error? y was defined inside the pow_two() function and therefore it is not accessible outside of that function. This is also true for the function’s arguments of course - x in this case.
If you’re familiar with other programming languages, you might wonder whether R passes arguments by value or by reference. Find out in the next exercise!
Once again, the text of this question is quite useful to us, so I’ll reprint it.
The title gives it away already: R passes arguments by value. What does this mean? Simply put, it means that an R function cannot change the variable that you input to that function. Let’s look at a simple example (try it in the console):
triple <- function(x) {
x <- 3*x
x
}
a <- 5
triple(a)
## [1] 15
a
## [1] 5
Inside the triple() function, the argument x gets overwritten with its value times three. Afterwards this new x is returned. If you call this function with a variable a set equal to 5, you obtain 15. But did the value of a change? If R were to pass a to triple() by reference, the override of the x inside the function would ripple through to the variable a, outside the function. However, R passes by value, so the R objects you pass to a function can never change unless you do an explicit assignment. a remains equal to 5, even after calling triple(a).
Given that R passes arguments by value and not by reference, the value of count is not changed after the first two calls of increment(). Only in the final expression, where count is re-assigned explicitly, does the value of count change.
Now that you’ve acquired some skills in defining functions with different types of arguments and return values, you should try to create more advanced functions. As you’ve noticed in the previous exercises, it’s perfectly possible to add control-flow constructs, loops and even other functions to your function body.
Remember our social media example, using the vectors linkedin and facebook? As a first step, you will be writing a function that can interpret a single value of this vector. In the next exercise, you will write another function that can handle an entire vector at once.
Note that the linkedin and facebook vectors will be returned to their original forms (without NAs).
linkedin <- c(16, 9, 13, 5, 2, 17, 14)
facebook <- c(17, 7, 5, 16, 8, 13, 14)
interpret <- function(num_views) {
if (num_views > 15) {
print("You're popular!")
return (num_views)
} else {
print("Try to be more visible!")
return(0)
}
}
interpret(linkedin[1])
## [1] "You're popular!"
## [1] 16
interpret(facebook[2])
## [1] "Try to be more visible!"
## [1] 0
The annoying thing here is that interpret() only takes one argument. Proceed to the next exercise to implement something more useful.
A possible implementation of the interpret() function is already available in the editor. In this exercise you’ll be writing another function that will use the interpret() function to interpret all the data from your daily profile views inside a vector. Furthermore, your function will return the sum of views on popular days, if asked for. A for loop is ideal for iterating over all the vector elements. The ability to return the sum of views on popular days is something you can code through a function argument with a default value.
interpret <- function(num_views) {
if (num_views > 15) {
print("You're popular!")
return(num_views)
} else {
print("Try to be more visible!")
return(0)
}
}
interpret_all <- function(views, return_sum = TRUE) {
count <- 0
for (v in views) {
count <- count + interpret(v)
}
if (return_sum == TRUE) {
return(count)
} else {
return(NULL)
}
}
interpret_all(linkedin)
## [1] "You're popular!"
## [1] "Try to be more visible!"
## [1] "Try to be more visible!"
## [1] "Try to be more visible!"
## [1] "Try to be more visible!"
## [1] "You're popular!"
## [1] "Try to be more visible!"
## [1] 33
interpret_all(facebook)
## [1] "You're popular!"
## [1] "Try to be more visible!"
## [1] "Try to be more visible!"
## [1] "You're popular!"
## [1] "Try to be more visible!"
## [1] "Try to be more visible!"
## [1] "Try to be more visible!"
## [1] 33
Have a look at the results; it appears that the sum of views on popular days are the same for Facebook and LinkedIn, what a coincidence! Your different social profiles must be fairly balanced ;-) Head over to the next video!
There are basically two extremely important functions when it comes down to R packages:
install.packages(), which as you can expect, installs a given package.library() which loads packages, i.e. attaches them to the search list on your R workspace.To install packages, you need administrator privileges. This means that install.packages() will thus not work in the DataCamp interface. However, almost all CRAN packages are installed on our servers. You can load them with library().
In this exercise, you’ll be learning how to load the ggplot2 package, a powerful package for data visualization. You’ll use it to create a plot of two variables of the mtcars data frame. The data has already been prepared for you in the workspace.
Before starting, execute the following commands in the console:
search(), to look at the currently attached packages andqplot(mtcars$wt, mtcars$hp), to build a plot of two variables of the mtcars data frame.An error should occur, because you haven’t loaded the ggplot2 package yet!
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.3
qplot(mtcars$wt, mtcars$hp)
search()
## [1] ".GlobalEnv" "package:ggplot2" "package:stats"
## [4] "package:graphics" "package:grDevices" "package:utils"
## [7] "package:datasets" "package:methods" "Autoloads"
## [10] "package:base"
Notice how search() and library() are closely interconnected functions. Head over to the next exercise.
The library() and require() functions are not very picky when it comes down to argument types: both library(rjson) and library("rjson") work perfectly fine for loading a package.
Only chunk 1 and chunk 2 are correct. Can you figure out why the last two aren’t valid? The warning you receive with chunk 4 makes it quite clear what’s wrong there. For chunk 3, it seems that the original author of the require() function wanted to allow people to be lazy, and not have to enclose the package name with quote marks "". To do this, they include a default setting within require(). View this using the args()function. Can you see why the changing on this default setting in chunk 4 combined with the lack of quotation marks throws an error?
This exercise concludes the chapter on functions. Well done!
Whenever you’re using a for loop, you may want to revise your code to see whether you can use the lapply function instead. Learn all about this intuitive way of applying a function over a list or a vector, and how to use its variants, sapply and vapply.
Before you go about solving the exercises below, have a look at the documentation of the lapply() function. The Usage section shows the following expression:
lapply(X, FUN, ...)
To put it generally, lapply takes a vector or list X, and applies the function FUN to each of its members. If FUN requires additional arguments, you pass them after you’ve specified X and FUN (in the ... part). The output of lapply() is a list, the same length as X, where each element is the result of applying FUN on the corresponding element of X.
Now that you are truly brushing up on your data science skills, let’s revisit some of the most relevant figures in data science history. We’ve compiled a vector of famous mathematicians/statisticians and the year they were born. Up to you to extract some information!
pioneers <- c("GAUSS:1777", "BAYES:1702", "PASCAL:1623", "PEARSON:1857")
split_math <- strsplit(pioneers, split = ":")
split_low <- lapply(split_math, tolower)
str(split_low)
## List of 4
## $ : chr [1:2] "gauss" "1777"
## $ : chr [1:2] "bayes" "1702"
## $ : chr [1:2] "pascal" "1623"
## $ : chr [1:2] "pearson" "1857"
As Filip explained in the instructional video, you can use lapply() on your own functions as well. You just need to code a new function and make sure it is available in the workspace. After that, you can use the function inside lapply() just as you did with base R functions.
In the previous exercise you already used lapply() once to convert the information about your favorite pioneering statisticians to a list of vectors composed of two character strings. Let’s write some code to select the names and the birth years separately.
The sample code already includes code that defined select_first(), that takes a vector as input and returns the first element of this vector.
select_first <- function(x) {
x[1]
}
names <- lapply(split_low, select_first)
select_second <- function(x) {
x[2]
}
years <- lapply(split_low, select_second)
Head over to the next exercise to learn about anonymous functions.
Writing your own functions and then using them inside lapply() is quite an accomplishment! But defining functions to use them only once is kind of overkill, isn’t it? That’s why you can use so-called anonymous functions in R.
Previously, you learned that functions in R are objects in their own right. This means that they aren’t automatically bound to a name. When you create a function, you can use the assignment operator to give the function a name. It’s perfectly possible, however, to not give the function a name. This is called an anonymous function:
# Named function
triple <- function(x) { 3 * x }
# Anonymous function with same implementation
function(x) { 3 * x }
## function(x) { 3 * x }
# Use anonymous function inside lapply()
lapply(list(1,2,3), function(x) { 3 * x })
## [[1]]
## [1] 3
##
## [[2]]
## [1] 6
##
## [[3]]
## [1] 9
names <- lapply(split_low, function(x) { x[1] } )
years <- lapply(split_low, function(x) { x[2] })
Now, there’s another way to solve the issue of using the select_*() functions only once: you can make a more generic function that can be used in more places. Find out more about this in the next exercise.
In the video, the triple() function was transformed to the multiply() function to allow for a more generic approach. lapply() provides a way to handle functions that require more than one argument, such as the multiply() function:
multiply <- function(x, factor) {
x * factor
}
lapply(list(1,2,3), multiply, factor = 3)
## [[1]]
## [1] 3
##
## [[2]]
## [1] 6
##
## [[3]]
## [1] 9
On the right we’ve included a generic version of the select functions that you’ve coded earlier: select_el(). It takes a vector as its first argument, and an index as its second argument. It returns the vector’s element at the specified index.
select_el <- function(x, index) {
x[index]
}
names <- lapply(split_low, select_el, index = 1)
years <- lapply(split_low, select_el, index = 2)
Your lapply skills are growing by the minute!
In all of the previous exercises, it was assumed that the functions that were applied over vectors and lists actually returned a meaningful result. For example, the tolower() function simply returns the strings with the characters in lowercase. This won’t always be the case. Suppose you want to display the structure of every element of a list. You could use the str() function for this, which returns NULL:
lapply(list(1, "a", TRUE), str)
## num 1
## chr "a"
## logi TRUE
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
This call actually returns a list, the same size as the input list, containing all NULL values. On the other hand calling
str(TRUE)
## logi TRUE
on its own prints only the structure of the logical to the console, not NULL. That’s because str() uses invisible() behind the scenes, which returns an invisible copy of the return value, NULL in this case. This prevents it from being printed when the result of str() is not assigned.
Feel free to experiment some more with your code in the console. Did you notice that lapply() always returns a list, no matter the input? This can be kind of annoying. In the next video tutorial you’ll learn about sapply() to solve this.
You can use sapply() similar to how you used lapply(). The first argument of sapply() is the list or vector X over which you want to apply a function, FUN. Potential additional arguments to this function are specified afterwards (...):
sapply(X, FUN, ...)
In the next couple of exercises, you’ll be working with the variable temp, that contains temperature measurements for 7 days. temp is a list of length 7, where each element is a vector of length 5, representing 5 measurements on a given day.
temp <- list(c(3, 7, 9, 6, -1), c(6, 9, 12, 13, 5), c(4, 8, 3, -1, -3), c(1, 4, 7, 2, -2), c(5, 7, 9, 4, 2), c(-3, 5, 8, 9, 4), c(3, 6, 9, 4, 1))
str(temp)
## List of 7
## $ : num [1:5] 3 7 9 6 -1
## $ : num [1:5] 6 9 12 13 5
## $ : num [1:5] 4 8 3 -1 -3
## $ : num [1:5] 1 4 7 2 -2
## $ : num [1:5] 5 7 9 4 2
## $ : num [1:5] -3 5 8 9 4
## $ : num [1:5] 3 6 9 4 1
lapply(temp, min)
## [[1]]
## [1] -1
##
## [[2]]
## [1] 5
##
## [[3]]
## [1] -3
##
## [[4]]
## [1] -2
##
## [[5]]
## [1] 2
##
## [[6]]
## [1] -3
##
## [[7]]
## [1] 1
sapply(temp, min)
## [1] -1 5 -3 -2 2 -3 1
lapply(temp, max)
## [[1]]
## [1] 9
##
## [[2]]
## [1] 13
##
## [[3]]
## [1] 8
##
## [[4]]
## [1] 7
##
## [[5]]
## [1] 9
##
## [[6]]
## [1] 9
##
## [[7]]
## [1] 9
sapply(temp, max)
## [1] 9 13 8 7 9 9 9
Can you tell the difference between the output of lapply() and sapply()? The former returns a list, while the latter returns a vector that is a simplified version of this list. Notice that this time, unlike in the cities example of the instructional video, the vector is not named.
Like lapply(), sapply() allows you to use self-defined functions and apply them over a vector or a list:
sapply(X, FUN, ...)
Here, FUN can be one of R’s built-in functions, but it can also be a function you wrote. This self-written function can be defined before hand, or can be inserted directly as an anonymous function.
extremes_avg <- function(x) {
( min(x) + max(x) ) / 2
}
sapply(temp, extremes_avg)
## [1] 4.0 9.0 2.5 2.5 5.5 3.0 5.0
lapply(temp, extremes_avg)
## [[1]]
## [1] 4
##
## [[2]]
## [1] 9
##
## [[3]]
## [1] 2.5
##
## [[4]]
## [1] 2.5
##
## [[5]]
## [1] 5.5
##
## [[6]]
## [1] 3
##
## [[7]]
## [1] 5
Of course, you could have solved this exercise using an anonymous function, but this would require you to use the code inside the definition of extremes_avg() twice. Duplicating code should be avoided as much as possible!
In the previous exercises, you’ve seen how sapply() simplifies the list that lapply() would return by turning it into a vector. But what if the function you’re applying over a list or a vector returns a vector of length greater than 1? If you don’t remember from the video, don’t waste more time in the valley of ignorance and head over to the instructions!
extremes <- function(x) {
c(min = min(x), max = max(x))
}
sapply(temp, extremes)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## min -1 5 -3 -2 2 -3 1
## max 9 13 8 7 9 9 9
lapply(temp, extremes)
## [[1]]
## min max
## -1 9
##
## [[2]]
## min max
## 5 13
##
## [[3]]
## min max
## -3 8
##
## [[4]]
## min max
## -2 7
##
## [[5]]
## min max
## 2 9
##
## [[6]]
## min max
## -3 9
##
## [[7]]
## min max
## 1 9
Have a final look at the console and see how sapply() did a great job at simplifying the rather uninformative ‘list of vectors’ that lapply() returns. It actually returned a nicely formatted matrix!
It seems like we’ve hit the jackpot with sapply(). On all of the examples so far, sapply() was able to nicely simplify the rather bulky output of lapply(). But, as with life, there are things you can’t simplify. How does sapply() react?
We already created a function, below_zero(), that takes a vector of numerical values and returns a vector that only contains the values that are strictly below zero.
below_zero <- function(x) {
return(x[x < 0])
}
freezing_s <- sapply(temp, below_zero)
freezing_l <- lapply(temp, below_zero)
identical(freezing_s, freezing_l)
## [1] TRUE
Given that the length of the output of below_zero() changes for different input vectors, sapply() is not able to nicely convert the output of lapply() to a nicely formatted matrix. Instead, the output values of sapply() and lapply() are exactly the same, as shown by the TRUE output of identical().
You already have some apply tricks under your sleeve, but you’re surely hungry for some more, aren’t you? In this exercise, you’ll see how sapply() reacts when it is used to apply a function that returns NULL over a vector or a list.
A function print_info(), that takes a vector and prints the average of this vector, has already been created for you. It uses the cat() function.
print_info <- function(x) {
cat("The average temperature is", mean(x), "\n")
}
sapply(temp, print_info)
## The average temperature is 4.8
## The average temperature is 9
## The average temperature is 2.2
## The average temperature is 2.4
## The average temperature is 5.4
## The average temperature is 4.6
## The average temperature is 4.6
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
##
## [[6]]
## NULL
##
## [[7]]
## NULL
lapply(temp, print_info)
## The average temperature is 4.8
## The average temperature is 9
## The average temperature is 2.2
## The average temperature is 2.4
## The average temperature is 5.4
## The average temperature is 4.6
## The average temperature is 4.6
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
##
## [[6]]
## NULL
##
## [[7]]
## NULL
Notice here that, quite surprisingly, sapply() does not simplify the list of NULL‘s. That’s because the ’vector-version’ of a list of NULL’s would simply be a NULL, which is no longer a vector with the same length as the input. Proceed to the next exercise.
This concludes the exercise set on sapply(). Head over to another video to learn all about vapply()!
Before you get your hands dirty with the third and last apply function that you’ll learn about in this intermediate R course, let’s take a look at its syntax. The function is called vapply(), and it has the following syntax:
vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)
Over the elements inside X, the function FUN is applied. The FUN.VALUE argument expects a template for the return argument of this function FUN. USE.NAMES is TRUE by default; in this case vapply() tries to generate a named array, if possible.
For the next set of exercises, you’ll be working on the temp list again, that contains 7 numerical vectors of length 5. We also coded a function basics() that takes a vector, and returns a named vector of length 3, containing the minimum, mean and maximum value of the vector respectively.
basics <- function(x) {
c(min = min(x), mean = mean(x), max = max(x))
}
vapply(temp, basics, numeric(3))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## min -1.0 5 -3.0 -2.0 2.0 -3.0 1.0
## mean 4.8 9 2.2 2.4 5.4 4.6 4.6
## max 9.0 13 8.0 7.0 9.0 9.0 9.0
Notice how, just as with sapply(), vapply() neatly transfers the names that you specify in the basics() function to the row names of the matrix that it returns.
So far you’ve seen that vapply() mimics the behavior of sapply() if everything goes according to plan. But what if it doesn’t?
In the video, Filip showed you that there are cases where the structure of the output of the function you want to apply, FUN, does not correspond to the template you specify in FUN.VALUE. In that case, vapply() will throw an error that informs you about the misalignment between expected and actual output.
basics <- function(x) {
c(min = min(x), mean = mean(x), median = median(x), max = max(x))
}
vapply(temp, basics, numeric(4))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## min -1.0 5 -3.0 -2.0 2.0 -3.0 1.0
## mean 4.8 9 2.2 2.4 5.4 4.6 4.6
## median 6.0 9 3.0 2.0 5.0 5.0 4.0
## max 9.0 13 8.0 7.0 9.0 9.0 9.0
As highlighted before, vapply() can be considered a more robust version of sapply(), because you explicitly restrict the output of the function you want to apply. Converting your sapply() expressions in your own R scripts to vapply() expressions is therefore a good practice (and also a breeze!).
vapply(temp, max, numeric(1))
## [1] 9 13 8 7 9 9 9
vapply(temp, function(x, y) { mean(x) > y }, y = 5, logical(1))
## [1] FALSE TRUE FALSE FALSE TRUE FALSE FALSE
You’ve got no more excuses to use sapply() in the future!
Mastering R programming is not only about understanding its programming concepts. Having a solid understanding of a wide range of R functions is also important. This chapter introduces you to many useful functions for data structure manipulation, regular expressions, and working with times and dates.
Have another look at some useful math functions that R features:
abs(): Calculate the absolute value.sum(): Calculate the sum of all the values in a data structure.mean(): Calculate the arithmetic mean.round(): Round the values to 0 decimal places by default. Try out ?round in the console for variations of round() and ways to change the number of digits to round to.As a data scientist in training, you’ve estimated a regression model on the sales data for the past six months. After evaluating your model, you see that the training error of your model is quite regular, showing both positive and negative values. The error values are already defined in the workspace on the right (errors).
errors <- c(1.9, -2.6, 4.0, -9.5, -3.4, 7.3)
sum(round(abs(errors)))
## [1] 29
We went ahead and included some code on the right, but there’s still an error. Can you trace it and fix it?
In times of despair, help with functions such as sum() and rev() are a single command away; simply use ?sum and ?rev in the console.
vec1 <- c(1.5, 2.5, 8.4, 3.7, 6.3)
vec2 <- rev(vec1)
mean(c(abs(vec1), abs(vec2)))
## [1] 4.48
If you check out the documentation of mean(), you’ll see that only the first argument, x, should be a vector. If you also specify a second argument, R will match the arguments by position and expect a specification of the trim argument. Therefore, merging the two vectors is a must!
R features a bunch of functions to juggle around with data structures::
seq(): Generate sequences, by specifying the from, to, and by arguments.rep(): Replicate elements of vectors and lists.sort(): Sort a vector in ascending order. Works on numerics, but also on character strings and logicals.rev(): Reverse the elements in a data structures for which reversal is defined.str(): Display the structure of any R object.append(): Merge vectors or lists.is.*(): Check for the class of an R object.as.*(): Convert an R object from one class to another.unlist(): Flatten (possibly embedded) lists to produce a vector.Remember the social media profile views data? We’ve use them again now, although in list form.
linkedin <- list(16, 9, 13, 5, 2, 17, 14)
facebook <- list(17, 7, 5, 16, 8, 13, 14)
li_vec <- unlist(linkedin)
fb_vec <- unlist(facebook)
Just as before, let’s switch roles. It’s up to you to see what unforgivable mistakes we’ve made. Go fix them!
rep(seq(1, 7, by = 2), times = 7)
## [1] 1 3 5 7 1 3 5 7 1 3 5 7 1 3 5 7 1 3 5 7 1 3 5 7 1 3 5 7
Debugging code is also a big part of the daily routine of a data scientist, and you seem to be great at it!
There is a popular story about young Gauss. As a pupil, he had a lazy teacher who wanted to keep the classroom busy by having them add up the numbers 1 to 100. Gauss came up with an answer almost instantaneously, 5050. On the spot, he had developed a formula for calculating the sum of an arithmetic series. There are more general formulas for calculating the sum of an arithmetic series with different starting values and increments. Instead of deriving such a formula, why not use R to calculate the sum of a sequence?
seq1 <- seq(1, 500, 3)
seq2 <- seq(1200, 900, -7)
sum(c(seq1, seq2))
## [1] 87029
In their most basic form, regular expressions can be used to see whether a pattern exists inside a character string or a vector of character strings. For this purpose, you can use:
grepl(), which returns TRUE when a pattern is found in the corresponding character string.grep(), which returns a vector of indices of the character strings that contains the pattern.Both functions need a pattern and an x argument, where pattern is the regular expression you want to match for, and the x argument is the character vector from which matches should be sought.
In this and the following exercises, you’ll be querying and manipulating a character vector of email addresses! The vector emails has already been defined in the editor on the right so you can begin with the instructions straight away!
emails <- c("john.doe@ivyleague.edu", "education@world.gov", "dalai.lama@peace.org",
"invalid.edu", "quant@bigdatacollege.edu", "cookie.monster@sesame.tv")
grepl("edu", emails)
## [1] TRUE TRUE FALSE TRUE TRUE FALSE
hits <- grep("edu", emails)
emails[hits]
## [1] "john.doe@ivyleague.edu" "education@world.gov"
## [3] "invalid.edu" "quant@bigdatacollege.edu"
You can probably guess what we’re trying to achieve here: select all the emails that end with “.edu”. However, the strings education@world.gov and invalid.edu were also matched. Let’s see in the next exercise what you can do to improve our pattern and remove these false positives.
You can use the caret, ^, and the dollar sign, $ to match the content located in the start and end of a string, respectively. This could take us one step closer to a correct pattern for matching only the “.edu” email addresses from our list of emails. But there’s more that can be added to make the pattern more robust:
@, because a valid email must contain an at-sign..*, which matches any character (the .) zero or more times (*). Both the dot and the asterisk are metacharacters. You can use them to match any character between the at-sign and the “.edu” portion of an email address.\\.edu$, to match the “.edu” part of the email at the end of the string. The \\ part escapes the dot: it tells R that you want to use the . as an actual character.grepl("@.*\\.edu$", emails)
## [1] TRUE FALSE FALSE FALSE TRUE FALSE
hits <- grep("@.*\\.edu$", emails)
emails[hits]
## [1] "john.doe@ivyleague.edu" "quant@bigdatacollege.edu"
A careful construction of our regular expression leads to more meaningful matches. However, even our robust email selector will often match some incorrect email addresses (for instance kiara@@fakemail.edu). Let’s not worry about this too much and continue with sub() and gsub() to actually edit the email addresses!
While grep() and grepl() were used to simply check whether a regular expression could be matched with a character vector, sub() and gsub() take it one step further: you can specify a replacement argument. If inside the character vector x, the regular expression pattern is found, the matching element(s) will be replaced with replacement. sub() only replaces the first match, whereas gsub() replaces all matches.
Suppose that emails vector you’ve been working with is an excerpt of DataCamp’s email database. Why not offer the owners of the .edu email addresses a new email address on the datacamp.edu domain? This could be quite a powerful marketing stunt: Online education is taking over traditional learning institutions! Convert your email and be a part of the new generation!
sub("@.*\\.edu$", "@datacamp.edu", emails)
## [1] "john.doe@datacamp.edu" "education@world.gov"
## [3] "dalai.lama@peace.org" "invalid.edu"
## [5] "quant@datacamp.edu" "cookie.monster@sesame.tv"
Notice how only the valid .edu addresses are changed while the other emails remain unchanged. To get a taste of other things you can accomplish with regex, head over to the next exercise.
Regular expressions are a typical concept that you’ll learn by doing and by seeing other examples. Before you rack your brains over the regular expression in this exercise, have a look at the new things that will be used:
.*: A usual suspect! It can be read as “any character that is matched zero or more times”.\\s: Match a space. The “s” is normally a character, escaping it (\\) makes it a metacharacter.[0-9]+: Match the numbers 0 to 9, at least once (+).([0-9]+): The parentheses are used to make parts of the matching string available to define the replacement. Refer to () references using \\1, \\2, etc. in the replacement argument of sub().The ([0-9]+) selects the entire number that comes before the word “nomination” in the string, and the entire match gets replaced by this number because of the \\1 that refers to the content inside the parentheses. The next video will get you up to speed with times and dates in R!
In R, dates are represented by Date objects, while times are represented by POSIXct objects. Under the hood, however, these dates and times are simple numerical values. Date objects store the number of days since the 1st of January in 1970. POSIXct objects on the other hand, store the number of seconds since the 1st of January in 1970.
The 1st of January in 1970 is the common origin for representing times and dates in a wide range of programming languages. There is no particular reason for this; it is a simple convention. Of course, it’s also possible to create dates and times before 1970; the corresponding numerical values are simply negative in this case.
today <- Sys.Date()
unclass(today)
## [1] 18580
now <- Sys.time()
unclass(now)
## [1] 1605345735
Using R to get the current date and time is nice, but you should also know how to create dates and times from character strings. Find out how in the next exercises!
To create a Date object from a simple character string in R, you can use the as.Date() function. The character string has to obey a format that can be defined using a set of symbols (the examples correspond to 13 January, 1982):
%Y: 4-digit year (1982)%y: 2-digit year (82)%m: 2-digit month (01)%d: 2-digit day of the month (13)%A: weekday (Wednesday)%a: abbreviated weekday (Wed)%B: month (January)%b: abbreviated month (Jan)The following R commands will all create the same Date object for the 13th day in January of 1982:
as.Date("1982-01-13")
## [1] "1982-01-13"
as.Date("Jan-13-82", format = "%b-%d-%y")
## [1] "1982-01-13"
as.Date("13 January, 1982", format = "%d %B, %Y")
## [1] "1982-01-13"
Notice that the first line here did not need a format argument, because by default R matches your character string to the formats "%Y-%m-%d" or "%Y/%m/%d".
In addition to creating dates, you can also convert dates to character strings that use a different date notation. For this, you use the format() function. Try the following lines of code:
today <- Sys.Date()
format(Sys.Date(), format = "%d %B, %Y")
## [1] "14 November, 2020"
format(Sys.Date(), format = "Today is a %A!")
## [1] "Today is a Saturday!"
str1 <- "May 23, '96"
str2 <- "2012-03-15"
str3 <- "30/January/2006"
date1 <- as.Date(str1, format = "%b %d, '%y")
date2 <- as.Date(str2, format = "%Y-%m-%d")
date3 <- as.Date(str3, format = "%d/%B/%Y")
format(date1, "%A")
## [1] "Thursday"
format(date2, "%d")
## [1] "15"
format(date3, "%b %Y")
## [1] "Jan 2006"
You can use POSIXct objects, i.e. Time objects in R, in a similar fashion. Give it a try in the next exercise.
Similar to working with dates, you can use as.POSIXct() to convert from a character string to a POSIXct object, and format() to convert from a POSIXct object to a character string. Again, you have a wide variety of symbols:
%H: hours as a decimal number (00-23)%I: hours as a decimal number (01-12)%M: minutes as a decimal number%S: seconds as a decimal number%T: shorthand notation for the typical format %H:%M:%S%p: AM/PM indicatorFor a full list of conversion symbols, consult the strptime documentation in the console.
Again, as.POSIXct() uses a default format to match character strings. In this case, it’s %Y-%m-%d %H:%M:%S. In this exercise, abstraction is made of different time zones.
str1 <- "May 23, '96 hours:23 minutes:01 seconds:45"
str2 <- "2012-3-12 14:23:08"
time1 <- as.POSIXct(str1, format = "%B %d, '%y hours:%H minutes:%M seconds:%S")
time2 <- as.POSIXct(str2, format = "%Y-%m-%d %T")
format(time1, format = "%M")
## [1] "01"
format(time2, format = "%I:%M %p")
## [1] "02:23 PM"
Both Date and POSIXct R objects are represented by simple numerical values under the hood. This makes calculation with time and date objects very straightforward: R performs the calculations using the underlying numerical values, and then converts the result back to human-readable time information again.
You can increment and decrement Date objects, or do actual calculations with them (try it out in the console!):
today <- Sys.Date()
today + 1
## [1] "2020-11-15"
today - 1
## [1] "2020-11-13"
as.Date("2015-03-12") - as.Date("2015-02-27")
## Time difference of 13 days
To control your eating habits, you decided to write down the dates of the last five days that you ate pizza. In the workspace, these dates are defined as five Date objects, day1 to day5. The code on the right also contains a vector pizza with these 5 Date objects.
day1 <- as.Date("2020-09-13")
day2 <- as.Date("2020-09-15")
day3 <- as.Date("2020-09-20")
day4 <- as.Date("2020-09-26")
day5 <- as.Date("2020-10-01")
day5 - day1
## Time difference of 18 days
pizza <- c(day1, day2, day3, day4, day5)
day_diff <- diff(pizza)
mean(day_diff)
## Time difference of 4.5 days
Calculations using POSIXct objects are completely analogous to those using Date objects. Try to experiment with this code to increase or decrease POSIXct objects:
now <- Sys.time()
now + 3600 # add an hour
## [1] "2020-11-14 10:22:16 GMT"
now - 3600 * 24 # subtract a day
## [1] "2020-11-13 09:22:16 GMT"
Adding or subtracting time objects is also straightforward:
birth <- as.POSIXct("1879-03-14 14:37:23")
death <- as.POSIXct("1955-04-18 03:47:12")
einstein <- death - birth
einstein
## Time difference of 27792.51 days
You’re developing a website that requires users to log in and out. You want to know what is the total and average amount of time a particular user spends on your website. This user has logged in 5 times and logged out 5 times as well. These times are gathered in the vectors login and logout, which are already defined in the workspace.
login <- c(as.POSIXct("2020-09-17 10:18:04 UTC"), as.POSIXct("2020-09-22 09:14:18 UTC"), as.POSIXct("2020-09-22 12:21:51 UTC"), as.POSIXct("2020-09-22 12:37:24 UTC"), as.POSIXct("2020-09-24 21:37:55 UTC"))
logout <- c(as.POSIXct("2020-09-17 10:56:29 UTC"), as.POSIXct("2020-09-22 09:14:52 UTC"), as.POSIXct("2020-09-22 12:35:48 UTC"), as.POSIXct("2020-09-22 13:17:22 UTC"), as.POSIXct("2020-09-24 22:08:47 UTC"))
time_online <- logout - login
time_online
## Time differences in secs
## [1] 2305 34 837 2398 1852
sum(time_online)
## Time difference of 7426 secs
mean(time_online)
## Time difference of 1485.2 secs
The dates when a season begins and ends can vary depending on who you ask. People in Australia will tell you that spring starts on September 1st. The Irish people in the Northern hemisphere will swear that spring starts on February 1st, with the celebration of St. Brigid’s Day. Then there’s also the difference between astronomical and meteorological seasons: while astronomers are used to equinoxes and solstices, meteorologists divide the year into 4 fixed seasons that are each three months long. (source: www.timeanddate.com)
A vector astro, which contains character strings representing the dates on which the 4 astronomical seasons start, has been defined on your workspace. Similarly, a vector meteo has already been created for you, with the meteorological beginnings of a season.
astro <- c("20-Mar-2015", "25-Jun-2015", "23-Sep-2015", "22-Dec-2015")
names(astro) <- c("spring", "summer", "fall", "winter")
meteo <- c("March 1, 15", "June 1, 15", "September 1, 15", "December 1, 15")
names(meteo) <- c("spring", "summer", "fall", "winter")
astro_dates <- as.Date(astro, format = "%d-%b-%Y")
meteo_dates <- as.Date(meteo, format = "%B %d, %y")
max(abs(astro_dates - meteo_dates))
## Time difference of 24 days
Impressive! Great job on finishing this course!
R Markdown is an easy to use formatting language you can use to reveal insights from data and author your findings as a PDF, HTML file, or Shiny app. In this course, you’ll learn how to create and modify each element of a Markdown file, including the code, text, and metadata. You’ll analyze data with dplyr, create visualizations with ggplot2, and author your analyses and plots as reports. You’ll gain hands-on experience of building reports as you work with real-world data from the International Finance Corporation (IFC)—learning how to efficiently organize reports using code chunk options, create lists and tables, and include a table of contents. By the end of the course, you’ll have the skills you need to add your brand’s fonts and colors using parameters and Cascading Style Sheets (CSS), to make your reports stand out.
In this chapter, you’ll learn about the three components of a Markdown file: the code, the text, and the metadata. You’ll also learn to add and modify each of these elements to your own reports, as you create your first Markdown files.
Throughout the course, you’ll be working on creating an investment report using two datasets from the World Bank IFC. The first dataset, investment_annual_summary, provides the summary of the dollars in millions provided to each region for each fiscal year, from 2012 to 2018. To get started on your report, you first want to print out the dataset.
To create your report, you’ll need to edit the Markdown file shown on the right of the DataCamp console, as described in the instructions, then press the green “Knit HTML” button to knit the file and see the resulting HTML file. We’ll discuss other output types later in the course.
Because this textbook is written using R Markdown, I’ll be including the R Markdown output for this chapter here.
In each exercise, the first code chunk in the Markdown file will load the readr package and the datasets you’ll be using in the exercise. You’ll learn more about the details of this code chunk later in the course, but you won’t need to modify it for any of the exercises in this chapter.
As you can see from the output, adding the dataset name to the code chunk and knitting the document resulted in the dataset being printed in your report. You can also see from the report that the dataset has 42 rows, and that each row contains information about the `fiscal_year, region, and dollars_in_millions for the investments made that year.
When creating your own reports, one of the first things you’ll want to do is add some code! In the video, we discussed how you can add your own code by adding code chunks. Previously, we looked at the investment_annual_summary dataset we’ll be using throughout the course. In this exercise, let’s take a look at the annual summary dataset as well as the other dataset we’ll be using, investment_services_projects.
From the report, you can see that the investment_services_project dataset contains information about each individual project and includes a lot more variables than the investment_annual_summary dataset. You’ll be exploring both of these datasets in much more depth in the coming exercises as you create your investment report!
Previously, you added the names of the datasets you’ll be using to build your report to the Markdown file that will be used to create the report. Now, you’ll add some headers and text to the document to provide some additional detail about the datasets to the audience who will read the report.
As you can see from the report, the more hashes you place in front of the text, the smaller the header will be when you knit the file.
The YAML header contains the metadata for your report and includes information like the title, author, and output format. In this exercise, you’ll update your report by adding some more detail about who created the report and when it was created.
Remember, you can add the date to your report manually by entering the date as a string. However, adding the date with Sys.Date() is much more efficient and scalable, since it will ensure that the date is updated automatically each time you edit and knit your file.
Now that your report includes some more high-level detail, you’d like to include the date using a different format. Be sure to refer to the tables of date formatting options from the video below.
Source: DataCamp
Keep in mind that there are many text and numeric options for formatting the date of your report. Now that you understand each of the elements of the Markdown file, and how to modify them, you’re ready to add more detail to your Investment Report!
In this chapter, you’ll use dplyr to begin to analyze the World Bank IFC datasets and include the analyses in your report. You’ll then create visualizations of the data using ggplot2 and learn to modify how the plots display in your knit report.
Previously, you learned how to filter the data to find out more information about projects that occurred in Indonesia. Now, you’ll build a report that provides this information for another country that’s included in the investment_services_project data. In this exercise, you’ll begin filtering the data to gather information about projects that occurred in Brazil.
From the report, you can see that there were a total of 88 projects in Brazil during the 2012 to 2018 fiscal years.
Now that you’ve filtered the data for the projects in a specific country, you can filter the results further to look at all projects that occurred in the 2018 fiscal year. Recall, the fiscal year starts on July 1st of the previous year and ends on June 30th of the year of interest.
Now your report includes information about all projects that occurred in Brazil during the 2012 to 2018 fiscal years and the projects that occurred in the 2018 fiscal year. Next, you’ll learn how to create and add plots to these reports, so that your audience is able to visualize this information when they read the final report.
In this exercise, you’ll use summarize() and brazil_investment_projects_2018 to find the total investment amount for all projects in Brazil in the 2018 fiscal year. Then, you’ll add text to the report to include the information and reference the code results in the text, so that the calculated amount is printed in the text of the report when you knit the file.
Now, if you update the analysis and the brazil_investment_projects_2018_total amount changes, you won’t need to modify the amount manually in the sentence that describes the information. Instead, since the object name is referenced in the text, the report will update automatically the next time the report is knit.
Now that you have all of the data ready for the report you’re creating, you can start making plots that will be included in the report to help your audience visualize the data when they’re reading the report. You’ll start by creating a line plot of the investment_annual_summary data.
Notice that for each region, the highest investments were made during the 2013 or 2014 fiscal years. This is an example of an insight you might want to include as text in the final report.
Previously, you created a line plot using the investment_annual_summary. Now, you’ll use the data you filtered to create scatterplots that summarize the information about the projects that occurred in Brazil.
Notice the warning message in the report that tells us that there were 7 rows removed that contain missing values. You’ll learn how to handle warning messages later in the course, but this may be something you’d want to include in the text of your report to specify which data points were excluded from the plot and why.
Now, you’ll create a line plot using the data that was filtered for all projects that occurred in Brazil in the 2018 fiscal year. In the previous exercises, the labels were added for you. While creating this plot, you’ll gain some experience adding your own labels that will appear when you knit the report.
Now your report has plots for the Investment Annual Summary data, as well as for all projects in Brazil during the 2012 to 2018 fiscal years and all projects in the 2018 fiscal year. Now that you’ve created these plots for your report, you’ll learn some of the options you have for formatting how the plots appear in the final report.
Now that your plots are ready to include in your report, you can modify how they appear once the file is knit. Previously, you learned the difference between setting options globally and setting them locally. In this exercise, you’ll set options for the figures globally, which means the options will apply to all figures throughout the code chunks in the report.
Recall, the options for fig.align are 'left', 'right', and 'center'. These options can be set globally or locally, depending on whether or not you want all figures to appear uniformly throughout the report, or if you want the options to vary by figure.
When creating a report, you may want to set the chunk options locally so that the figure display in the final report varies. The investment_annual_summary data provides helpful background information, but the focus of the report is on projects in Brazil. In this exercise, you’ll modify the chunk options locally so that the plots that display information about projects in Brazil appear slightly larger in the final report than the plot that provides the overview of the Investment Annual Summary data.
You can see in the final report that the plots that display information about projects in Brazil are slightly larger than the plot that provides the Investment Annual Summary overview. Notice that the fig.align = center option remained in the setup code chunk at the top, so this option has been set globally and determines the alignment for all figures in the report.
Also note that you can override globally set options with locally set ones. For example, if you wanted all figures to display at 50% width except for two figures you wanted to be larger, you could include out.width = '50%' in the global options but out.width = '95%' in the local options of the two figures.
Now that the figures have been modified, you’ll add some captions to label the figures and provide some information about what is displayed in each plot.
Now you can see that each figure is labeled in the final report and includes a caption that describes what each plot displays.
Now that you’ve learned how to add, label, and modify code chunks, you’ll learn about code chunk options. You can use these to determine whether the code and results appear in the knit report. You’ll also discover how to create lists and tables to include in your report.
Previously, you learned how to add text to your Markdown file to include additional information for your audience. Now, you’ll create a bulleted list to specify which regions are included in the investment_annual_summary data. Refer to the image below from the video to recall the list of regions that should be included in your table.
Source: DataCamp
Remember, you can structure the list formatting by adding indentation before an item on the list.
When adding a list to your report, you can use either bulleted or numbered lists. In this exercise, you’ll modify the bulleted list of regions from the previous exercise to create a numbered list.
Adding a numbered list to the report is a helpful way to quickly display the total number of regions that are included in the data.
Previously, you printed the datasets used in your report to your report so that the audience was able to look through the data themselves. Now, you’ll create a table of the investment_region_summary to display this information more clearly to your audience. The investment_region_summary provides the total of all investments for each region from the 2012 to 2018 fiscal years.
Now your report begins with a list of all regions in the investment_annual_summary data, as well as a table that summarizes the total investments that were made to each region across the 2012 to 2018 fiscal years.
include=FALSE: Code runs but does not appears in report, results do not appear.echo=FALSE: Code runs but does not appear in the report, results appear.eval=FALSE: Code appears in report but does not run, results do not appear.The other option you recently learned is collapse, which was the only option you’ve learned so far that has a default of FALSE.
By default, the collapse option is set to FALSE, and the code and any output appear in the knit file in separate blocks. You encountered this earlier when creating plots of the data. In this exercise, you’ll modify the Markdown file so that the code and resulting warning messages appear in the same block.
Notice that the warning messages for the plots now appear as comments in the same block as the code that creates the plot, instead of being separated into an individual block. You’ll learn more about warning messages and how to specify whether or not they are included in the report soon!
The exercises in the course have used include = FALSE to prevent the code and results of the setup and data chunks from appearing in the knit report. Although you won’t modify those options, since the code and results from those chunks should be excluded from the report, it’s important to note how they impact the final report.
In this exercise, you’ll use the echo option to modify whether or not the code appears in the report.
Even though the code no longer displays in the knit report, the results of the code still appears. This is an option you’ll want to use if your report is being written for a non-technical audience, who is interested in the results but may not be interested in the code itself. Notice that the warning messages that mention that data was excluded from the plots still appear in the knit report. Next, you’ll learn how to modify whether or not these warning messages are included in the report!
In the past, you haven’t encountered messages in the report because the include option has been set to FALSE in the data chunk to prevent the code or the results from the code from appearing in the report. In this exercise, you’ll use the message option to prevent messages from appearing in the report, while still including the code in the report.
Notice that the code for the data chunk is now included without any of the messages that you would otherwise see when loading the data or packages used in the report.
Previously, you used the collapse option so that the code and resulting warning messages appear in the same block in the knit report. In this exercise, you’ll use the warning option to prevent warnings from appearing in the final report.
Notice that DataCamp added a sentence before each of the code chunks that were impacted to let the audience know that: ‘Projects that do not have an associated investment amount are excluded from the plot’. If you are excluding any warning messages from the report, it’s important to include information about any data that is not included and why.
In this final chapter, you’ll learn how to customize your report by adding a table of contents and adding a CSS file to the YAML header, to personalize reports with your brand’s fonts and colors. You’ll also learn how to efficiently create new reports from your data using parameters, which will save you time from manually updating existing reports to create new ones.
Adding a table of contents to your report is a useful way to help your audience navigate through the different sections of your report. It provides an overview of what your report contains, and can help your audience navigate through the report easily. In this exercise, you’ll add a table of contents to your report to provide an overview of the topics that the report includes.
Now your audience can reference the table of contents at the beginning of your report to understand the information that will be covered in the report.
Now that you’ve added a table of contents, you’ll modify how it appears in the report and which information it includes. You’ll use toc_depth to specify the depth of headers that will be included in the table of contents and number_sections to add section numbering for the headers in the report.
The headers were modified to start with a single hash before adding section numbering because, if the largest headers in the report start with two hashes, the section numbering will start with zeros. Remember that, for toc_depth, the default depth is 3 for HTML documents and 2 for PDF documents.
When toc_float is included, the table of contents appears on the left side of the document and remains visible while the reader scrolls through the document. By default, it displays the largest header, will expand as someone is reading through the report or interacting with the table of contents to navigate to another section, and animates page scrolls when navigating the report.
In this exercise, you’ll add toc_float and modify these settings using the collapsed and smooth_scroll fields so that the full table of contents remains visible and page scrolls are not animated.
If you want to add toc_float to the report and keep the default collapsed and smooth_scroll options (both turned on, i.e. TRUE, by default), you can set the toc_float field to true in the YAML header.
In this exercise, you’ll add a parameter for country to the report and modify the existing code so that you can create new reports about the investment projects for any country included in the investment_services_projects data.
Now that you’ve reviewed the code for your report, you’ll review the text and the YAML header of the document before creating a new report using the country parameter.
Now that you’ve added a parameter to the document, you’ll create a new report for Bangladesh from the investment_services_projects data using the country parameter.
Before knitting the report, you’ll review and modify the text of the document to ensure that the knit report will reflect the country that is specified in the parameter.
Notice that by only modifying the parameter, you were able to create a new report for Bangladesh that includes all of the information that was previously provided for Brazil. When creating new documents using parameters, there may be some information you want to add that is specific to the new country and report, but parameters are an efficient and quick way to get started!
Previously, you added a parameter for country to create new reports to summarize information about the investment projects for any country included in the investment_services_projects data. Now, you’ll add parameters for the fiscal year and modify the existing code so that you can create new reports about the investment projects for any country and fiscal year from the investment_services_projects data.
Now, you’ll be able to create a report for any country and fiscal year by modifying only the parameters in the YAML header!
Now that you’ve added parameters to account for the fiscal year, you’ll create a new report for another country and fiscal year from the investment_services_projects data.
Now you’re able to create new reports using multiple parameters by modifying only the information in the YAML header. As before, there may be some information you want to add that is specific to the country and fiscal year, but these parameters will help you get started on a new report!
Now that you’ve learned how to customize the style of your report, you’ll begin to add specific fonts and colors to your existing report.
Notice that the document background and code chunks in the knit file are modified to reflect the various fonts and colors you listed.
In this exercise, you’ll continue to add styles by modifying the table of contents and header sections of the Markdown file.
Notice the difference in appearance of the table of contents section. If you want to customize any of these sections further, you can add the properties you’ve used so far to any of the sections listed. For example, you can add opactity to the pre section to customize the code chunks, the same way that you did with the #header section.
Rather than adding styles to each Markdown file within the file, you can create and reference a Cascading Style Sheet (CSS) file each time you create a new file that contains particular styles and fonts.
In this exercise, the styles you’ve specified have been added to a CSS file called styles.css. You’ll reference this file in the YAML header instead of specifying the styles within the Markdown file.
Notice that, although the styles are no longer listed within the Markdown file, the knit report still reflects all of the styles you’ve been adding over the past few exercises.
Say you’ve found a great dataset and would like to learn more about it. How can you start to answer the questions you have about the data? You can use dplyr to answer those questions—it can also help with basic transformations of your data. You’ll also learn to aggregate your data and add, remove, or change the variables. Along the way, you’ll explore a dataset containing information about counties in the United States. You’ll finish the course by applying these tools to the babynames dataset to explore trends of baby names in the United States.
Learn verbs you can use to transform your data, including select, filter, arrange, and mutate. You’ll use these functions to modify the counties dataset to view particular observations and answer questions about the data.
counties data set and the dplyr packagecounties <- read.csv("data_acs2015_county_data.csv")
library("dplyr")
## Warning: package 'dplyr' was built under R version 3.5.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Take a look at the counties dataset using the glimpse() function.
glimpse(counties)
## Observations: 3,141
## Variables: 40
## $ census_id <int> 1001, 1003, 1005, 1007, 1009, 1011, 1013, 1015, ...
## $ state <fct> Alabama, Alabama, Alabama, Alabama, Alabama, Ala...
## $ county <fct> Autauga, Baldwin, Barbour, Bibb, Blount, Bullock...
## $ region <fct> South, South, South, South, South, South, South,...
## $ metro <fct> , , , , , , , , , , , , , , , , , , , , , , , , ,
## $ population <int> 55221, 195121, 26932, 22604, 57710, 10678, 20354...
## $ men <int> 26745, 95314, 14497, 12073, 28512, 5660, 9502, 5...
## $ women <int> 28476, 99807, 12435, 10531, 29198, 5018, 10852, ...
## $ hispanic <dbl> 2.6, 4.5, 4.6, 2.2, 8.6, 4.4, 1.2, 3.5, 0.4, 1.5...
## $ white <dbl> 75.8, 83.1, 46.2, 74.5, 87.9, 22.2, 53.3, 73.0, ...
## $ black <dbl> 18.5, 9.5, 46.7, 21.4, 1.5, 70.7, 43.8, 20.3, 40...
## $ native <dbl> 0.4, 0.6, 0.2, 0.4, 0.3, 1.2, 0.1, 0.2, 0.2, 0.6...
## $ asian <dbl> 1.0, 0.7, 0.4, 0.1, 0.1, 0.2, 0.4, 0.9, 0.8, 0.3...
## $ pacific <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...
## $ citizens <int> 40725, 147695, 20714, 17495, 42345, 8057, 15581,...
## $ income <int> 51281, 50254, 32964, 38678, 45813, 31938, 32229,...
## $ income_err <int> 2391, 1263, 2973, 3995, 3141, 5884, 1793, 925, 2...
## $ income_per_cap <int> 24974, 27317, 16824, 18431, 20532, 17580, 18390,...
## $ income_per_cap_err <int> 1080, 711, 798, 1618, 708, 2055, 714, 489, 1366,...
## $ poverty <dbl> 12.9, 13.4, 26.7, 16.8, 16.7, 24.6, 25.4, 20.5, ...
## $ child_poverty <dbl> 18.6, 19.2, 45.3, 27.9, 27.2, 38.4, 39.2, 31.6, ...
## $ professional <dbl> 33.2, 33.1, 26.8, 21.5, 28.5, 18.8, 27.5, 27.3, ...
## $ service <dbl> 17.0, 17.7, 16.1, 17.9, 14.1, 15.0, 16.6, 17.7, ...
## $ office <dbl> 24.2, 27.1, 23.1, 17.8, 23.9, 19.7, 21.9, 24.2, ...
## $ construction <dbl> 8.6, 10.8, 10.8, 19.0, 13.5, 20.1, 10.3, 10.5, 1...
## $ production <dbl> 17.1, 11.2, 23.1, 23.7, 19.9, 26.4, 23.7, 20.4, ...
## $ drive <dbl> 87.5, 84.7, 83.8, 83.2, 84.9, 74.9, 84.5, 85.3, ...
## $ carpool <dbl> 8.8, 8.8, 10.9, 13.5, 11.2, 14.9, 12.4, 9.4, 11....
## $ transit <dbl> 0.1, 0.1, 0.4, 0.5, 0.4, 0.7, 0.0, 0.2, 0.2, 0.2...
## $ walk <dbl> 0.5, 1.0, 1.8, 0.6, 0.9, 5.0, 0.8, 1.2, 0.3, 0.6...
## $ other_transp <dbl> 1.3, 1.4, 1.5, 1.5, 0.4, 1.7, 0.6, 1.2, 0.4, 0.7...
## $ work_at_home <dbl> 1.8, 3.9, 1.6, 0.7, 2.3, 2.8, 1.7, 2.7, 2.1, 2.5...
## $ mean_commute <dbl> 26.5, 26.4, 24.1, 28.8, 34.9, 27.5, 24.6, 24.1, ...
## $ employed <int> 23986, 85953, 8597, 8294, 22189, 3865, 7813, 474...
## $ private_work <dbl> 73.6, 81.5, 71.8, 76.8, 82.0, 79.5, 77.4, 74.1, ...
## $ public_work <dbl> 20.9, 12.3, 20.8, 16.1, 13.5, 15.1, 16.2, 20.8, ...
## $ self_employed <dbl> 5.5, 5.8, 7.3, 6.7, 4.2, 5.4, 6.2, 5.0, 2.8, 7.9...
## $ family_work <dbl> 0.0, 0.4, 0.1, 0.4, 0.4, 0.0, 0.2, 0.1, 0.0, 0.5...
## $ unemployment <dbl> 7.6, 7.5, 17.6, 8.3, 7.7, 18.0, 10.9, 12.3, 8.9,...
## $ land_area <dbl> 594.44, 1589.78, 884.88, 622.58, 644.78, 622.81,...
Select the following four columns from the counties variable:
statecountypopulationpovertyYou don’t need to save the result to a variable.
counties %>%
select(state, county, population, poverty)
Recall that if you want to keep the data you’ve selected, you can use assignment to create a new table.
Here you see the counties_selected dataset with a few interesting variables selected. These variables: private_work, public_work, self_employed describe whether people work for the government, for private companies, or for themselves.
In these exercises, you’ll sort these observations to find the most interesting cases.
counties_selected <- counties %>%
select(state, county, population, private_work, public_work, self_employed)
counties_selected %>%
arrange(desc(public_work))
We sorted the counties in descending order according to public_work. What if we were interested in looking at observations in counties that have a large population or within a specific state? Let’s take a look at that next!
You use the filter() verb to get only observations that match a particular condition, or match multiple conditions.
counties_selected <- counties %>%
select(state, county, population)
counties_selected %>%
filter(population > 1000000)
counties_selected %>%
filter(state == "California",
population > 1000000)
## state county population
## 1 California Alameda 1584983
## 2 California Contra Costa 1096068
## 3 California Los Angeles 10038388
## 4 California Orange 3116069
## 5 California Riverside 2298032
## 6 California Sacramento 1465832
## 7 California San Bernardino 2094769
## 8 California San Diego 3223096
## 9 California Santa Clara 1868149
Now you know that there are 9 counties in the state of California with a population greater than one million. In the next exercise, you’ll practice filtering and then sorting a dataset to focus on specific observations!
We’re often interested in both filtering and sorting a dataset, to focus on observations of particular interest to you. Here, you’ll find counties that are extreme examples of what fraction of the population works in the private sector.
counties_selected <- counties %>%
select(state, county, population, private_work, public_work, self_employed)
counties_selected %>%
filter(state == "Texas", population > 10000) %>%
arrange(desc(private_work))
## state county population private_work public_work self_employed
## 1 Texas Gregg 123178 84.7 9.8 5.4
## 2 Texas Collin 862215 84.1 10.0 5.8
## 3 Texas Dallas 2485003 83.9 9.5 6.4
## 4 Texas Harris 4356362 83.4 10.1 6.3
## 5 Texas Andrews 16775 83.1 9.6 6.8
## 6 Texas Tarrant 1914526 83.1 11.4 5.4
## 7 Texas Titus 32553 82.5 10.0 7.4
## 8 Texas Denton 731851 82.2 11.9 5.7
## 9 Texas Ector 149557 82.0 11.2 6.7
## 10 Texas Moore 22281 82.0 11.7 5.9
## 11 Texas Jefferson 252872 81.9 13.4 4.5
## 12 Texas Fort Bend 658331 81.7 12.5 5.7
## 13 Texas Panola 23900 81.5 11.7 6.9
## 14 Texas Midland 151290 81.2 11.4 7.1
## 15 Texas Potter 122352 81.2 12.4 6.2
## 16 Texas Frio 18168 81.1 15.4 3.4
## 17 Texas Johnson 155450 80.9 13.3 5.7
## 18 Texas Smith 217552 80.9 12.7 6.3
## 19 Texas Orange 83217 80.8 13.7 5.4
## 20 Texas Harrison 66417 80.6 13.9 5.4
## 21 Texas Brazoria 331741 80.5 14.4 5.0
## 22 Texas Calhoun 21666 80.5 12.8 6.7
## 23 Texas Austin 28886 80.4 11.8 7.7
## 24 Texas Montgomery 502586 80.4 11.7 7.8
## 25 Texas Jim Wells 41461 80.3 12.6 7.1
## 26 Texas Hutchinson 21858 80.2 13.9 5.8
## 27 Texas Ochiltree 10642 80.1 10.4 9.2
## 28 Texas Victoria 90099 80.1 12.5 7.1
## 29 Texas Hardin 55375 80.0 15.7 4.2
## 30 Texas Bexar 1825502 79.6 14.8 5.6
## 31 Texas Gray 22983 79.6 13.3 7.0
## 32 Texas McLennan 241505 79.5 14.7 5.7
## 33 Texas Newton 14231 79.4 15.8 4.8
## 34 Texas Ward 11225 79.3 15.0 5.3
## 35 Texas Grimes 26961 79.2 15.0 5.7
## 36 Texas Lamar 49566 79.2 13.9 6.7
## 37 Texas Rusk 53457 79.2 12.8 7.9
## 38 Texas Wharton 41264 78.9 12.3 8.6
## 39 Texas Williamson 473592 78.9 14.9 6.0
## 40 Texas Marion 10248 78.7 15.2 5.9
## 41 Texas Hood 53171 78.6 11.0 9.9
## 42 Texas Hunt 88052 78.6 14.4 7.0
## 43 Texas Ellis 157058 78.5 14.2 7.3
## 44 Texas Upshur 40096 78.5 13.0 8.3
## 45 Texas Grayson 122780 78.4 13.8 7.5
## 46 Texas Liberty 77486 78.4 13.6 7.8
## 47 Texas Atascosa 47050 78.1 14.3 7.5
## 48 Texas Waller 45847 78.1 16.3 5.6
## 49 Texas Deaf Smith 19245 78.0 13.6 7.9
## 50 Texas Chambers 37251 77.9 15.9 6.1
## 51 Texas Jasper 35768 77.7 15.5 6.7
## 52 Texas Scurry 17238 77.7 16.2 6.0
## 53 Texas Parmer 10004 77.6 12.5 9.3
## 54 Texas San Patricio 66070 77.6 17.9 4.4
## 55 Texas Taylor 134435 77.5 16.5 5.7
## 56 Texas Fayette 24849 77.4 13.4 9.0
## 57 Texas Kaufman 109289 77.4 16.7 5.8
## 58 Texas Matagorda 36598 77.4 15.6 7.0
## 59 Texas Comal 119632 77.2 13.2 9.1
## 60 Texas Gonzales 20172 77.2 13.2 9.1
## 61 Texas Hill 34923 77.1 15.0 7.6
## 62 Texas Hockley 23322 77.1 15.6 7.1
## 63 Texas Guadalupe 143460 77.0 17.9 4.8
## 64 Texas Cooke 38761 76.9 14.8 8.1
## 65 Texas Rockwall 85536 76.9 16.2 6.7
## 66 Texas Wise 61243 76.9 14.9 8.1
## 67 Texas Gaines 18916 76.8 12.0 11.2
## 68 Texas Tom Green 115056 76.8 16.5 6.6
## 69 Texas Erath 40039 76.7 15.3 7.9
## 70 Texas Hopkins 35645 76.7 13.2 9.8
## 71 Texas Palo Pinto 27921 76.7 14.1 9.0
## 72 Texas Nueces 352060 76.6 16.3 6.9
## 73 Texas Parker 121418 76.6 15.2 8.0
## 74 Texas Lubbock 290782 76.5 16.8 6.5
## 75 Texas Travis 1121645 76.5 16.0 7.4
## 76 Texas Eastland 18328 76.4 15.1 7.8
## 77 Texas Montague 19478 76.4 13.3 9.9
## 78 Texas Angelina 87748 76.2 17.8 5.9
## 79 Texas Jackson 14486 76.1 13.6 9.3
## 80 Texas Nolan 15061 76.1 17.1 6.5
## 81 Texas Cass 30328 76.0 15.9 8.1
## 82 Texas Duval 11577 76.0 17.2 6.4
## 83 Texas Hale 35504 75.9 16.4 7.4
## 84 Texas Galveston 308163 75.8 18.2 5.8
## 85 Texas Bosque 17971 75.7 12.8 11.2
## 86 Texas Henderson 79016 75.6 13.9 10.4
## 87 Texas Bowie 93155 75.5 19.4 5.1
## 88 Texas Aransas 24292 75.4 12.4 12.2
## 89 Texas Randall 126782 75.4 17.5 7.1
## 90 Texas Morris 12700 75.3 17.5 7.1
## 91 Texas Rains 11037 75.3 15.6 9.0
## 92 Texas Shelby 25725 75.3 16.8 7.8
## 93 Texas Brown 37833 75.2 16.5 8.2
## 94 Texas Sabine 10440 75.1 18.1 6.8
## 95 Texas Wood 42712 75.0 15.6 9.0
## 96 Texas Gillespie 25398 74.8 10.5 14.3
## 97 Texas Navarro 48118 74.8 17.3 7.8
## 98 Texas Caldwell 39347 74.7 18.6 6.7
## 99 Texas Tyler 21462 74.6 19.8 5.3
## 100 Texas Webb 263251 74.6 19.4 5.8
## 101 Texas Wichita 131957 74.6 19.2 6.0
## 102 Texas Lee 16664 74.5 16.7 8.4
## 103 Texas Burnet 44144 74.1 14.2 11.7
## 104 Texas El Paso 831095 73.9 20.0 6.0
## 105 Texas Kendall 37361 73.9 15.7 10.4
## 106 Texas Lamb 13742 73.8 16.1 9.9
## 107 Texas Camp 12516 73.7 16.0 9.9
## 108 Texas Freestone 19586 73.6 19.0 7.3
## 109 Texas Medina 47392 73.6 17.3 9.0
## 110 Texas Washington 34236 73.6 18.1 8.3
## 111 Texas Hays 177562 73.5 19.4 7.0
## 112 Texas Live Oak 11873 73.5 14.2 12.1
## 113 Texas Lavaca 19549 73.4 14.4 12.1
## 114 Texas Zapata 14308 73.3 18.1 8.7
## 115 Texas Fannin 33748 73.1 19.8 7.1
## 116 Texas Young 18329 73.1 15.7 10.8
## 117 Texas Cherokee 51167 72.9 20.4 6.5
## 118 Texas Kerr 50149 72.8 16.7 10.4
## 119 Texas Cameron 417947 72.7 18.2 8.9
## 120 Texas Colorado 20757 72.6 17.3 9.0
## 121 Texas Clay 10479 72.4 17.0 10.4
## 122 Texas DeWitt 20540 72.4 17.7 9.6
## 123 Texas Franklin 10599 72.3 13.7 13.3
## 124 Texas Comanche 13623 72.2 15.0 12.3
## 125 Texas Howard 36105 72.0 23.0 4.9
## 126 Texas Llano 19323 72.0 11.4 16.4
## 127 Texas Nacogdoches 65531 72.0 20.0 7.9
## 128 Texas Wilson 45509 71.8 20.6 7.6
## 129 Texas Van Zandt 52736 71.6 16.8 11.1
## 130 Texas Bastrop 76948 71.3 20.1 8.3
## 131 Texas Callahan 13532 71.2 17.0 9.4
## 132 Texas Falls 17410 70.8 20.5 8.4
## 133 Texas Hidalgo 819217 70.7 17.2 11.9
## 134 Texas Bee 32659 70.5 22.7 6.7
## 135 Texas Dimmit 10682 70.5 21.9 7.6
## 136 Texas Blanco 10723 70.4 17.5 11.8
## 137 Texas Zavala 12060 70.4 23.8 5.6
## 138 Texas Terry 12687 70.2 19.7 9.8
## 139 Texas Anderson 57915 69.8 23.6 6.4
## 140 Texas Robertson 16532 69.8 22.4 7.6
## 141 Texas Karnes 14879 69.6 22.0 7.8
## 142 Texas Kleberg 32029 69.5 25.9 4.4
## 143 Texas Burleson 17293 69.4 21.7 8.4
## 144 Texas Leon 16819 69.4 14.5 15.7
## 145 Texas Bandera 20796 69.2 19.0 11.6
## 146 Texas Bell 326041 69.2 25.7 5.0
## 147 Texas San Jacinto 27023 69.1 23.0 7.6
## 148 Texas Maverick 56548 68.8 25.0 6.1
## 149 Texas Trinity 14405 68.8 17.8 12.1
## 150 Texas Jones 19978 68.7 20.5 10.3
## 151 Texas Polk 46113 68.7 22.2 8.7
## 152 Texas Pecos 15807 68.6 24.5 6.0
## 153 Texas Dawson 13542 68.5 21.9 9.5
## 154 Texas Milam 24344 68.5 21.3 9.8
## 155 Texas Red River 12567 68.5 17.4 13.8
## 156 Texas Brazos 205271 68.3 25.6 5.8
## 157 Texas Starr 62648 68.2 22.5 8.6
## 158 Texas Houston 22949 67.8 22.1 9.8
## 159 Texas Reeves 14179 67.8 28.3 3.9
## 160 Texas Runnels 10445 67.7 19.9 12.1
## 161 Texas Willacy 22002 66.9 23.9 9.1
## 162 Texas Uvalde 26952 66.8 23.8 9.4
## 163 Texas Wilbarger 13158 66.0 29.6 4.4
## 164 Texas Val Verde 48980 65.9 28.9 5.1
## 165 Texas Lampasas 20219 65.5 26.5 8.0
## 166 Texas Madison 13838 65.1 22.4 11.8
## 167 Texas Limestone 23454 61.3 28.6 10.1
## 168 Texas Coryell 76128 60.3 34.2 5.3
## 169 Texas Walker 69330 59.5 36.2 4.2
You’ve learned how to filter and sort a dataset to answer questions about the data. Notice that you only need to slightly modify your code if you are interested in sorting the observations by a different column.
In the video, you used the unemployment variable, which is a percentage, to calculate the number of unemployed people in each county. In this exercise, you’ll do the same with another percentage variable: public_work.
The code provided already selects the state, county, population, and public_work columns.
counties_selected <- counties %>%
select(state, county, population, public_work)
counties_selected %>%
mutate(public_workers = public_work * population / 100) %>%
arrange(desc(public_workers))
It looks like Los Angeles is the county with the most government employees.
The dataset includes columns for the total number (not percentage) of men and women in each county. You could use this, along with the population variable, to compute the fraction of men (or women) within each county.
In this exercise, you’ll select the relevant columns yourself.
counties_selected <- counties %>%
select(state, county, population, men, women)
counties_selected %>%
mutate(proportion_women = women / population)
Notice that the proportion_women variable was added as a column to the counties_selected dataset, and the data now has 6 columns instead of 5.
In this exercise, you’ll put together everything you’ve learned in this chapter (select(), mutate(), filter() and arrange()), to find the counties with the highest proportion of men.
counties %>%
# Select the five columns
select(state, county, population, men, women) %>%
# Add the proportion_men variable
mutate(proportion_men = men / population) %>%
# Filter for population of at least 10,000
filter(population >= 10000) %>%
# Arrange proportion of men in descending order
arrange(desc(proportion_men))
## state county population men women
## 1 Virginia Sussex 11864 8130 3734
## 2 California Lassen 32645 21818 10827
## 3 Georgia Chattahoochee 11914 7940 3974
## 4 Louisiana West Feliciana 15415 10228 5187
## 5 Florida Union 15191 9830 5361
## 6 Texas Jones 19978 12652 7326
## 7 Missouri DeKalb 12782 8080 4702
## 8 Texas Madison 13838 8648 5190
## 9 Virginia Greensville 11760 7303 4457
## 10 Texas Anderson 57915 35469 22446
## 11 Arkansas Lincoln 14062 8596 5466
## 12 Texas Bee 32659 19722 12937
## 13 Florida Hamilton 14395 8671 5724
## 14 Illinois Lawrence 16665 9997 6668
## 15 Mississippi Tallahatchie 14959 8907 6052
## 16 Mississippi Greene 14129 8409 5720
## 17 Texas Karnes 14879 8799 6080
## 18 Texas Frio 18168 10723 7445
## 19 Florida Gulf 15785 9235 6550
## 20 Florida Franklin 11628 6800 4828
## 21 Texas Walker 69330 40484 28846
## 22 Georgia Telfair 16416 9502 6914
## 23 Colorado Fremont 46809 27003 19806
## 24 Louisiana Allen 25653 14784 10869
## 25 Ohio Noble 14508 8359 6149
## 26 Georgia Tattnall 25302 14529 10773
## 27 Louisiana Claiborne 16639 9524 7115
## 28 Texas Pecos 15807 9003 6804
## 29 Missouri Pulaski 53443 30390 23053
## 30 Texas Howard 36105 20496 15609
## 31 Illinois Johnson 12829 7267 5562
## 32 Texas Dawson 13542 7670 5872
## 33 Florida DeSoto 34957 19756 15201
## 34 Texas Reeves 14179 8004 6175
## 35 Florida Taylor 22685 12781 9904
## 36 Tennessee Bledsoe 13686 7677 6009
## 37 Louisiana Grant 22362 12523 9839
## 38 Florida Wakulla 31128 17366 13762
## 39 Tennessee Morgan 21794 12150 9644
## 40 Kentucky Morgan 13428 7477 5951
## 41 Florida Bradford 27223 15150 12073
## 42 California Kings 150998 83958 67040
## 43 Kentucky Martin 12631 7020 5611
## 44 Virginia Buckingham 17068 9476 7592
## 45 California Del Norte 27788 15418 12370
## 46 Missouri Pike 18517 10273 8244
## 47 Florida Jackson 48900 27124 21776
## 48 Mississippi Yazoo 27911 15460 12451
## 49 Florida Glades 13272 7319 5953
## 50 New York Franklin 51280 28253 23027
## 51 Pennsylvania Union 44958 24755 20203
## 52 Tennessee Wayne 16897 9295 7602
## 53 Colorado Summit 28940 15912 13028
## 54 Texas Grimes 26961 14823 12138
## 55 Georgia Macon 14045 7720 6325
## 56 Ohio Madison 43456 23851 19605
## 57 Arkansas St. Francis 27345 14996 12349
## 58 Michigan Chippewa 38586 21156 17430
## 59 Kentucky McCreary 18001 9863 8138
## 60 Florida Washington 24629 13478 11151
## 61 Virginia Prince George 37380 20440 16940
## 62 Florida Calhoun 14615 7991 6624
## 63 North Carolina Avery 17695 9673 8022
## 64 Indiana Sullivan 21111 11540 9571
## 65 Oregon Malheur 30551 16697 13854
## 66 Wyoming Carbon 15739 8592 7147
## 67 Maryland Somerset 25980 14150 11830
## 68 Tennessee Hardeman 26253 14297 11956
## 69 Florida Dixie 16091 8746 7345
## 70 Florida Hardee 27468 14920 12548
## 71 Texas Willacy 22002 11947 10055
## 72 North Dakota Williams 29619 16066 13553
## 73 Colorado Logan 21928 11889 10039
## 74 Michigan Houghton 36660 19876 16784
## 75 Texas Tyler 21462 11629 9833
## 76 North Carolina Onslow 183753 99526 84227
## 77 New York Wyoming 41446 22442 19004
## 78 Illinois Randolph 33069 17902 15167
## 79 California Amador 36995 20012 16983
## 80 Texas Live Oak 11873 6421 5452
## 81 Michigan Gogebic 15824 8556 7268
## 82 Oklahoma Hughes 13785 7450 6335
## 83 Texas Scurry 17238 9312 7926
## 84 Georgia Dooly 14293 7717 6576
## 85 Wisconsin Adams 20451 11033 9418
## 86 Florida Okeechobee 39255 21176 18079
## 87 North Carolina Greene 21328 11499 9829
## 88 Missouri St. Francois 66010 35586 30424
## 89 Texas Terry 12687 6835 5852
## 90 Colorado Gunnison 15651 8427 7224
## 91 Missouri Mississippi 14208 7650 6558
## 92 Alabama Barbour 26932 14497 12435
## 93 Indiana Miami 36211 19482 16729
## 94 South Carolina Edgefield 26466 14234 12232
## 95 Oklahoma Okfuskee 12248 6585 5663
## 96 Virginia Powhatan 28207 15159 13048
## 97 Louisiana East Feliciana 19855 10666 9189
## 98 Oklahoma Beckham 23300 12515 10785
## 99 Texas Polk 46113 24757 21356
## 100 Wisconsin Jackson 20543 11022 9521
## 101 Alaska Fairbanks North Star Borough 99705 53477 46228
## 102 Indiana Perry 19414 10407 9007
## 103 Arizona Graham 37407 20049 17358
## 104 Kentucky Clay 21300 11415 9885
## 105 Tennessee Johnson 18017 9643 8374
## 106 Colorado Chaffee 18309 9798 8511
## 107 Louisiana Catahoula 10247 5483 4764
## 108 Georgia Charlton 13130 7024 6106
## 109 Florida Holmes 19635 10501 9134
## 110 Alaska Kodiak Island Borough 13973 7468 6505
## 111 Kansas Leavenworth 78227 41806 36421
## 112 Alabama Bibb 22604 12073 10531
## 113 Minnesota Pine 29218 15605 13613
## 114 Colorado Grand 14411 7689 6722
## 115 South Carolina Marlboro 27993 14930 13063
## 116 Kansas Riley 75022 40002 35020
## 117 Texas Gray 22983 12253 10730
## 118 Texas Houston 22949 12227 10722
## 119 Oklahoma Woodward 20986 11172 9814
## 120 Ohio Marion 65943 35022 30921
## 121 Georgia Butts 23445 12451 10994
## 122 Pennsylvania Wayne 51642 27420 24222
## 123 Illinois Perry 21810 11572 10238
## 124 Michigan Gratiot 41878 22213 19665
## 125 Colorado Eagle 52576 27887 24689
## 126 Indiana Putnam 37650 19967 17683
## 127 Missouri Cooper 17593 9328 8265
## 128 Colorado Pitkin 17420 9235 8185
## 129 Wyoming Goshen 13544 7180 6364
## 130 Alabama Bullock 10678 5660 5018
## 131 Florida Jefferson 14198 7522 6676
## 132 Florida Hendry 38363 20318 18045
## 133 Virginia Nottoway 15711 8320 7391
## 134 Georgia Dodge 21180 11215 9965
## 135 Massachusetts Nantucket 10556 5589 4967
## 136 Michigan Ionia 64064 33917 30147
## 137 Wyoming Sublette 10117 5353 4764
## 138 Wisconsin Juneau 26494 14018 12476
## 139 Florida Monroe 75901 40159 35742
## 140 Pennsylvania Huntingdon 45906 24284 21622
## 141 Illinois Lee 35027 18524 16503
## 142 Louisiana Winn 14855 7851 7004
## 143 Ohio Pickaway 56515 29866 26649
## 144 Kentucky Oldham 63037 33307 29730
## 145 Texas Fannin 33748 17810 15938
## 146 New York Seneca 35144 18527 16617
## 147 Texas Rusk 53457 28172 25285
## 148 Florida Madison 18729 9869 8860
## 149 Colorado Park 16189 8525 7664
## 150 Florida Gilchrist 16992 8946 8046
## 151 Florida Baker 27135 14277 12858
## 152 Alaska Bethel Census Area 17776 9351 8425
## 153 Virginia Manassas Park city 15625 8219 7406
## 154 Nevada Humboldt 17067 8971 8096
## 155 Wisconsin Waushara 24321 12783 11538
## 156 Texas DeWitt 20540 10795 9745
## 157 Oklahoma Atoka 13906 7307 6599
## 158 Idaho Idaho 16312 8571 7741
## 159 Ohio Ross 77334 40627 36707
## 160 Oklahoma Texas 21588 11340 10248
## 161 Utah Sanpete 28261 14845 13416
## 162 Illinois Fayette 22136 11622 10514
## 163 Tennessee Hickman 24283 12745 11538
## 164 Virginia Southampton 18410 9661 8749
## 165 Virginia Brunswick 16930 8882 8048
## 166 Virginia Lunenburg 12558 6588 5970
## 167 Virginia Lee 25206 13223 11983
## 168 Kentucky Christian 74159 38891 35268
## 169 Wyoming Albany 37565 19692 17873
## 170 South Dakota Meade 26381 13823 12558
## 171 North Carolina Anson 26135 13693 12442
## 172 Alaska Kenai Peninsula Borough 57221 29974 27247
## 173 Missouri Moniteau 15801 8277 7524
## 174 Missouri Randolph 25135 13159 11976
## 175 Pennsylvania Centre 157823 82583 75240
## 176 Missouri Phelps 45029 23557 21472
## 177 Idaho Teton 10285 5379 4906
## 178 Wisconsin Dodge 88547 46309 42238
## 179 Mississippi Sunflower 27911 14594 13317
## 180 Louisiana Vernon 52476 27435 25041
## 181 Nevada Elko 51562 26951 24611
## 182 Missouri Osage 13758 7190 6568
## 183 Texas Freestone 19586 10230 9356
## 184 Colorado Routt 23606 12328 11278
## 185 South Carolina Lee 18461 9641 8820
## 186 Alaska Matanuska-Susitna Borough 96178 50205 45973
## 187 Virginia Norfolk city 245452 128079 117373
## 188 California Tuolumne 54079 28218 25861
## 189 Wyoming Sweetwater 44772 23359 21413
## 190 Georgia Mitchell 22982 11990 10992
## 191 Texas Duval 11577 6038 5539
## 192 Oregon Umatilla 76738 40004 36734
## 193 New York Essex 38912 20283 18629
## 194 Tennessee Lauderdale 27427 14296 13131
## 195 Washington Franklin 86443 45052 41391
## 196 South Dakota Yankton 22636 11796 10840
## 197 New Mexico Socorro 17494 9115 8379
## 198 Kansas Geary 36787 19166 17621
## 199 New Mexico Torrance 15853 8258 7595
## 200 Maryland Allegany 73549 38300 35249
## 201 Illinois Montgomery 29348 15282 14066
## 202 Colorado Moffat 13117 6830 6287
## 203 Wisconsin Chippewa 63209 32911 30298
## 204 Pennsylvania Clearfield 81343 42338 39005
## 205 New York Jefferson 118947 61875 57072
## 206 California Colusa 21396 11129 10267
## 207 Oklahoma Caddo 29495 15339 14156
## 208 Utah Juab 10400 5404 4996
## 209 Arizona Pinal 389772 202502 187270
## 210 Wisconsin Grant 51489 26741 24748
## 211 Iowa Jones 20560 10674 9886
## 212 Arkansas Izard 13480 6997 6483
## 213 North Dakota Stark 28628 14859 13769
## 214 Wyoming Campbell 48013 24914 23099
## 215 Illinois Fulton 36323 18844 17479
## 216 Iowa Story 93586 48551 45035
## 217 Montana Jefferson 11502 5965 5537
## 218 Iowa Jefferson 17318 8981 8337
## 219 Oklahoma Comanche 125531 65098 60433
## 220 Indiana Switzerland 10500 5443 5057
## 221 Texas Kleberg 32029 16601 15428
## 222 Idaho Gooding 15233 7895 7338
## 223 Michigan Manistee 24536 12714 11822
## 224 Virginia Montgomery 96467 49981 46486
## 225 Pennsylvania Somerset 76617 39695 36922
## 226 Pennsylvania Greene 37938 19652 18286
## 227 Nebraska Saline 14360 7438 6922
## 228 New York Washington 62700 32471 30229
## 229 North Dakota Ward 67736 35076 32660
## 230 Virginia Wise 40530 20986 19544
## 231 Washington Mason 60791 31476 29315
## 232 West Virginia Monongalia 101668 52641 49027
## 233 Oregon Morrow 11204 5800 5404
## 234 Nebraska Colfax 10522 5445 5077
## 235 Indiana LaPorte 111280 57575 53705
## 236 New York Greene 48312 24996 23316
## 237 Florida Columbia 67806 35080 32726
## 238 Texas Wichita 131957 68245 63712
## 239 Wyoming Teton 22311 11537 10774
## 240 Minnesota Chisago 53834 27832 26002
## 241 Louisiana LaSalle 14899 7702 7197
## 242 California Mono 14146 7311 6835
## 243 Texas Moore 22281 11512 10769
## 244 Missouri Texas 25735 13296 12439
## 245 New Mexico Curry 50497 26088 24409
## 246 Texas Medina 47392 24482 22910
## 247 Texas Hale 35504 18337 17167
## 248 Texas Newton 14231 7347 6884
## 249 Indiana Henry 49146 25371 23775
## 250 Kentucky Carroll 10830 5590 5240
## 251 Idaho Elmore 26175 13509 12666
## 252 Minnesota Nobles 21687 11190 10497
## 253 Idaho Owyhee 11364 5863 5501
## 254 South Carolina Hampton 20473 10561 9912
## 255 Kansas Ford 34714 17906 16808
## 256 Missouri Washington 25048 12920 12128
## 257 Idaho Fremont 12945 6677 6268
## 258 Illinois Crawford 19541 10078 9463
## 259 Michigan Branch 43706 22539 21167
## 260 North Dakota Richland 16317 8414 7903
## 261 Michigan Montcalm 63004 32482 30522
## 262 Texas Limestone 23454 12091 11363
## 263 New York Clinton 81685 42107 39578
## 264 Iowa Page 15660 8072 7588
## 265 New Jersey Cumberland 157035 80938 76097
## 266 Georgia Camden 51445 26515 24930
## 267 Georgia Chattooga 25241 13007 12234
## 268 Kentucky Union 15138 7800 7338
## 269 Montana Sanders 11346 5846 5500
## 270 West Virginia Preston 33809 17419 16390
## 271 Montana Gallatin 95323 49111 46212
## 272 Minnesota Carlton 35443 18256 17187
## 273 North Carolina Pamlico 12982 6685 6297
## 274 Colorado Las Animas 14503 7467 7036
## 275 Texas Parmer 10004 5150 4854
## 276 Alabama Escambia 37935 19524 18411
## 277 Wisconsin Crawford 16483 8483 8000
## 278 Iowa Webster 37295 19193 18102
## 279 West Virginia Randolph 29365 15111 14254
## 280 Washington Grays Harbor 71419 36748 34671
## 281 Louisiana Jackson 16109 8288 7821
## 282 California Trinity 13373 6878 6495
## 283 Washington Pend Oreille 12968 6665 6303
## 284 North Dakota Grand Forks 68979 35439 33540
## 285 Alaska Ketchikan Gateway Borough 13699 7038 6661
## 286 Pennsylvania McKean 42884 22029 20855
## 287 Illinois Jefferson 38578 19817 18761
## 288 New Mexico Lea 68149 35007 33142
## 289 Missouri Johnson 54155 27818 26337
## 290 Texas Potter 122352 62842 59510
## 291 Virginia New Kent 19560 10045 9515
## 292 California Kern 865736 444547 421189
## 293 Illinois Clinton 37929 19473 18456
## 294 West Virginia Hampshire 23542 12085 11457
## 295 Georgia Wayne 30046 15421 14625
## 296 Idaho Latah 38339 19676 18663
## 297 Nevada Carson City 54482 27959 26523
## 298 Arkansas Hot Spring 33316 17089 16227
## 299 Wisconsin Bayfield 15050 7719 7331
## 300 Oregon Jefferson 22061 11313 10748
## 301 Iowa Buena Vista 20507 10516 9991
## 302 New York Cayuga 79173 40576 38597
## 303 North Carolina Alexander 37158 19042 18116
## 304 Minnesota Yellow Medicine 10092 5170 4922
## 305 North Dakota Walsh 11005 5635 5370
## 306 Montana Richland 11132 5699 5433
## 307 Alaska Anchorage Municipality 299107 153122 145985
## 308 California Monterey 428441 219299 209142
## 309 Michigan Jackson 159759 81765 77994
## 310 Minnesota Roseau 15615 7991 7624
## 311 Oklahoma Craig 14744 7545 7199
## 312 New Mexico Los Alamos 17939 9179 8760
## 313 Alaska Juneau City and Borough 32531 16645 15886
## 314 Washington Walla Walla 59726 30558 29168
## 315 Wisconsin Burnett 15334 7845 7489
## 316 California Imperial 178206 91167 87039
## 317 Colorado Garfield 57076 29186 27890
## 318 Texas Jefferson 252872 129292 123580
## 319 Arizona Yuma 202987 103779 99208
## 320 Utah Uintah 35721 18262 17459
## 321 Nebraska Seward 16998 8690 8308
## 322 Arkansas Scott 10870 5557 5313
## 323 Oklahoma Payne 79423 40596 38827
## 324 Georgia Coffee 43003 21980 21023
## 325 Texas Ochiltree 10642 5439 5203
## 326 Illinois Christian 34200 17478 16722
## 327 New Mexico Otero 65318 33379 31939
## 328 Wisconsin Oconto 37476 19149 18327
## 329 Michigan Missaukee 14988 7658 7330
## 330 Illinois Cass 13254 6772 6482
## 331 Wisconsin Taylor 20569 10509 10060
## 332 Minnesota Todd 24466 12500 11966
## 333 Georgia Crawford 12539 6406 6133
## 334 New York Sullivan 76330 38992 37338
## 335 Wyoming Lincoln 18316 9356 8960
## 336 Iowa Henry 20080 10254 9826
## 337 Louisiana Beauregard 36259 18513 17746
## 338 Idaho Boundary 10961 5596 5365
## 339 Indiana Tippecanoe 180952 92366 88586
## 340 Florida Santa Rosa 161021 82189 78832
## 341 Arizona La Paz 20335 10378 9957
## 342 Michigan Lake 11426 5831 5595
## 343 South Carolina Saluda 20000 10206 9794
## 344 Iowa Louisa 11271 5751 5520
## 345 Texas Andrews 16775 8557 8218
## 346 Texas Rains 11037 5630 5407
## 347 Louisiana Iberville 33229 16949 16280
## 348 Texas Cherokee 51167 26098 25069
## 349 Ohio Belmont 69560 35479 34081
## 350 Nevada Churchill 24252 12368 11884
## 351 Iowa Allamakee 14060 7169 6891
## 352 Arizona Cochise 129647 66100 63547
## 353 Utah Millard 12582 6414 6168
## 354 California San Luis Obispo 276517 140953 135564
## 355 Utah Morgan 10276 5238 5038
## 356 Texas Bastrop 76948 39211 37737
## 357 Maryland Washington 149270 76058 73212
## 358 Minnesota Sherburne 90401 46062 44339
## 359 Iowa Jasper 36726 18711 18015
## 360 Colorado Archuleta 12174 6202 5972
## 361 Wisconsin Marquette 15140 7713 7427
## 362 Missouri Webster 36690 18691 17999
## 363 North Dakota Ramsey 11566 5892 5674
## 364 Washington Kitsap 255441 130126 125315
## 365 South Dakota Brookings 33046 16834 16212
## 366 North Carolina Granville 58109 29595 28514
## 367 Louisiana Evangeline 33768 17197 16571
## 368 Utah Summit 38521 19616 18905
## 369 Minnesota Rice 64886 33038 31848
## 370 Pennsylvania Schuylkill 146360 74521 71839
## 371 New Hampshire Coos 31870 16226 15644
## 372 Minnesota Aitkin 15839 8064 7775
## 373 Washington Whitman 46737 23794 22943
## 374 Michigan Kalkaska 17230 8770 8460
## 375 California San Francisco 840763 427909 412854
## 376 New York St. Lawrence 112011 57007 55004
## 377 Minnesota Cass 28519 14512 14007
## 378 South Dakota Lake 12086 6149 5937
## 379 Kansas Seward 23274 11839 11435
## 380 Georgia Banks 18336 9327 9009
## 381 Florida Sumter 108501 55190 53311
## 382 Utah Sevier 20871 10615 10256
## 383 Kansas Finney 37133 18884 18249
## 384 Minnesota Renville 15171 7715 7456
## 385 Oregon Baker 16052 8163 7889
## 386 Kansas Jefferson 18898 9610 9288
## 387 Idaho Blaine 21309 10836 10473
## 388 North Dakota Morton 28985 14737 14248
## 389 Oklahoma Pittsburg 44961 22859 22102
## 390 Mississippi Stone 17978 9140 8838
## 391 Missouri Callaway 44566 22656 21910
## 392 North Carolina Bertie 20518 10429 10089
## 393 Washington Adams 19081 9698 9383
## 394 Wyoming Uinta 20930 10636 10294
## 395 Virginia Amelia 12777 6492 6285
## 396 Missouri McDonald 22763 11565 11198
## 397 Wisconsin Kewaunee 20483 10406 10077
## 398 Nebraska Platte 32642 16580 16062
## 399 California Yuba 73437 37300 36137
## 400 Michigan Lenawee 98902 50231 48671
## 401 Colorado La Plata 53182 27010 26172
## 402 Texas Franklin 10599 5383 5216
## 403 Georgia Gilmer 28673 14560 14113
## 404 Idaho Cassia 23369 11866 11503
## 405 New Mexico Eddy 55641 28251 27390
## 406 Wisconsin Columbia 56607 28741 27866
## 407 Montana Silver Bow 34549 17539 17010
## 408 Indiana Montgomery 38172 19378 18794
## 409 Iowa Lyon 11723 5951 5772
## 410 Indiana Pulaski 13047 6623 6424
## 411 Indiana Vigo 108268 54952 53316
## 412 North Dakota Stutsman 21076 10697 10379
## 413 Virginia Dickenson 15463 7848 7615
## 414 Minnesota Kanabec 16003 8121 7882
## 415 Illinois Bond 17313 8784 8529
## 416 Louisiana Concordia 20449 10375 10074
## 417 Florida Walton 59487 30176 29311
## 418 Kentucky Muhlenberg 31309 15881 15428
## 419 Illinois Logan 29956 15194 14762
## 420 Utah Duchesne 19817 10051 9766
## 421 Indiana Steuben 34267 17379 16888
## 422 Nebraska Box Butte 11310 5736 5574
## 423 Idaho Minidoka 20279 10284 9995
## 424 Minnesota Lake 10750 5451 5299
## 425 Mississippi Adams 31979 16214 15765
## 426 Wisconsin Buffalo 13319 6753 6566
## 427 Texas Brazoria 331741 168196 163545
## 428 Oklahoma Latimer 10774 5462 5312
## 429 Minnesota Pope 10948 5550 5398
## 430 Texas Brazos 205271 104060 101211
## 431 Idaho Washington 10025 5082 4943
## 432 Texas Lamb 13742 6966 6776
## 433 Texas Val Verde 48980 24828 24152
## 434 Kentucky Marion 19717 9994 9723
## 435 Georgia Liberty 64427 32654 31773
## 436 Montana Hill 16523 8374 8149
## 437 Arkansas Stone 12512 6340 6172
## 438 Minnesota Sibley 15021 7611 7410
## 439 Colorado Teller 23340 11826 11514
## 440 California Plumas 18966 9608 9358
## 441 Michigan Arenac 15424 7813 7611
## 442 Ohio Richland 122312 61944 60368
## 443 Idaho Franklin 12914 6540 6374
## 444 Montana Fergus 11468 5807 5661
## 445 Oregon Tillamook 25430 12875 12555
## 446 Wisconsin Clark 34518 17475 17043
## 447 Virginia Buchanan 23486 11889 11597
## 448 Michigan Oceana 26229 13277 12952
## 449 Washington Okanogan 41332 20922 20410
## 450 Indiana Spencer 20856 10557 10299
## 451 Wisconsin Lafayette 16835 8521 8314
## 452 Hawaii Honolulu 984178 498129 486049
## 453 New Mexico Roosevelt 19908 10076 9832
## 454 Indiana Knox 38062 19264 18798
## 455 Wyoming Natrona 80011 40495 39516
## 456 Wisconsin Sawyer 16483 8342 8141
## 457 West Virginia Taylor 16977 8592 8385
## 458 Iowa Marshall 40962 20730 20232
## 459 Michigan Mackinac 11044 5589 5455
## 460 Wisconsin Waupaca 52125 26378 25747
## 461 Minnesota Itasca 45354 22949 22405
## 462 Idaho Jerome 22653 11462 11191
## 463 Alabama Coosa 11027 5579 5448
## 464 Texas Caldwell 39347 19907 19440
## 465 Tennessee Jackson 11496 5816 5680
## 466 Texas Nolan 15061 7618 7443
## 467 New York Allegany 48070 24313 23757
## 468 Missouri Cole 76533 38701 37832
## 469 New York Lewis 27124 13715 13409
## 470 Illinois Morgan 35129 17761 17368
## 471 Virginia Augusta 74053 37440 36613
## 472 Minnesota Le Sueur 27707 14008 13699
## 473 Pennsylvania Northumberland 94006 47525 46481
## 474 Minnesota Morrison 32962 16663 16299
## 475 Michigan Crawford 13895 7024 6871
## 476 California Glenn 28029 14168 13861
## 477 Wisconsin Monroe 45274 22883 22391
## 478 Idaho Jefferson 26792 13541 13251
## 479 North Carolina Caswell 23174 11711 11463
## 480 Nebraska Saunders 20913 10568 10345
## 481 Michigan Marquette 67582 34150 33432
## 482 Minnesota Polk 31547 15940 15607
## 483 Texas Gaines 18916 9556 9360
## 484 Indiana Owen 21192 10705 10487
## 485 Washington Grant 92070 46507 45563
## 486 Iowa Mills 14862 7507 7355
## 487 Indiana Newton 14057 7100 6957
## 488 North Carolina Craven 104450 52755 51695
## 489 Florida Okaloosa 192237 97092 95145
## 490 Indiana LaGrange 38084 19233 18851
## 491 Indiana Crawford 10591 5347 5244
## 492 Wisconsin Green Lake 18966 9574 9392
## 493 North Carolina McDowell 44961 22696 22265
## 494 Wyoming Converse 14101 7118 6983
## 495 Michigan Menominee 23717 11972 11745
## 496 Minnesota Meeker 23129 11675 11454
## 497 Michigan Osceola 23234 11728 11506
## 498 Utah Tooele 60893 30737 30156
## 499 Wisconsin Dunn 44159 22290 21869
## 500 Idaho Payette 22700 11458 11242
## 501 Minnesota Jackson 10211 5154 5057
## 502 Texas Bowie 93155 47014 46141
## 503 Utah Emery 10728 5414 5314
## 504 New Mexico Cibola 27382 13817 13565
## 505 Kansas Rice 10014 5053 4961
## 506 California Merced 263885 133152 130733
## 507 Wyoming Big Horn 11895 6002 5893
## 508 Nebraska Dawson 24069 12142 11927
## 509 Indiana Martin 10262 5176 5086
## 510 Oregon Benton 86495 43624 42871
## 511 Iowa Sioux 34509 17404 17105
## 512 Missouri Ste. Genevieve 17990 9072 8918
## 513 Wisconsin Price 13800 6959 6841
## 514 Kentucky Meade 29098 14672 14426
## 515 Utah Box Elder 50991 25709 25282
## 516 Illinois Livingston 37689 19002 18687
## 517 Mississippi George 23104 11648 11456
## 518 Wisconsin Richland 17746 8946 8800
## 519 Michigan Lapeer 88235 44477 43758
## 520 Minnesota Wabasha 21381 10777 10604
## 521 North Dakota Cass 162500 81905 80595
## 522 North Carolina Warren 20468 10316 10152
## 523 Kansas Cowley 36079 18184 17895
## 524 Minnesota Isanti 38296 19300 18996
## 525 Wisconsin Oneida 35653 17968 17685
## 526 Pennsylvania Fulton 14694 7405 7289
## 527 West Virginia Hardy 13936 7023 6913
## 528 Washington Skamania 11243 5665 5578
## 529 Virginia Culpeper 48424 24398 24026
## 530 Virginia Scott 22570 11369 11201
## 531 West Virginia Fayette 45534 22936 22598
## 532 South Dakota Union 14842 7476 7366
## 533 Wisconsin Barron 45686 23012 22674
## 534 Tennessee Lewis 11944 6016 5928
## 535 Texas Travis 1121645 564941 556704
## 536 Texas Lee 16664 8393 8271
## 537 Wisconsin Trempealeau 29412 14813 14599
## 538 Missouri Lewis 10172 5123 5049
## 539 Kansas Reno 64058 32262 31796
## 540 Ohio Allen 105196 52977 52219
## 541 Iowa Crawford 17252 8688 8564
## 542 Georgia Washington 20785 10467 10318
## 543 Michigan Cass 51952 26159 25793
## 544 Virginia Stafford 137145 69055 68090
## 545 Minnesota Faribault 14230 7165 7065
## 546 Illinois De Witt 16388 8251 8137
## 547 Wisconsin Polk 43572 21937 21635
## 548 Pennsylvania Indiana 87895 44252 43643
## 549 Nebraska Adams 31442 15829 15613
## 550 Illinois Cumberland 10943 5509 5434
## 551 Georgia Lanier 10403 5237 5166
## 552 South Dakota Davison 19787 9961 9826
## 553 Illinois Greene 13502 6797 6705
## 554 Texas Blanco 10723 5398 5325
## 555 Utah Wasatch 26661 13421 13240
## 556 Arizona Mohave 203362 102371 100991
## 557 South Dakota Codington 27750 13969 13781
## 558 West Virginia McDowell 20802 10471 10331
## 559 Nevada Lyon 51657 26001 25656
## 560 Minnesota Stearns 152595 76804 75791
## 561 Michigan Alcona 10550 5310 5240
## 562 North Carolina Camden 10161 5114 5047
## 563 Missouri Morgan 20225 10179 10046
## 564 Iowa Keokuk 10291 5179 5112
## 565 Louisiana West Carroll 11454 5764 5690
## 566 Colorado Adams 471206 237107 234099
## 567 Utah Davis 323374 162703 160671
## 568 Idaho Madison 37916 19077 18839
## 569 Pennsylvania Perry 45677 22981 22696
## 570 Colorado El Paso 655024 329550 325474
## 571 Pennsylvania Susquehanna 42369 21316 21053
## 572 Mississippi Pontotoc 30517 15353 15164
## 573 New York Schuyler 18410 9262 9148
## 574 Utah Utah 551957 277687 274270
## 575 Michigan Mecosta 43301 21784 21517
## 576 Montana Missoula 111966 56328 55638
## 577 Idaho Shoshone 12571 6324 6247
## 578 Wisconsin Winnebago 169004 85019 83985
## 579 Minnesota Blue Earth 65125 32760 32365
## 580 Texas Waller 45847 23062 22785
## 581 Wisconsin Sheboygan 115226 57960 57266
## 582 Nevada Washoe 435019 218795 216224
## 583 Michigan Newaygo 48029 24156 23873
## 584 Washington Lincoln 10363 5212 5151
## 585 Minnesota Mille Lacs 25809 12980 12829
## 586 Georgia Monroe 26915 13536 13379
## 587 Colorado Weld 270948 136230 134718
## 588 Minnesota Wright 128691 64704 63987
## 589 Missouri Ralls 10243 5150 5093
## 590 Washington Stevens 43548 21895 21653
## 591 Indiana Jennings 28113 14134 13979
## 592 Kansas Nemaha 10159 5107 5052
## 593 Texas Ector 149557 75182 74375
## 594 Michigan Barry 59147 29733 29414
## 595 Minnesota Kandiyohi 42444 21336 21108
## 596 Minnesota St. Louis 200506 100791 99715
## 597 Montana Park 15708 7896 7812
## 598 Nevada Douglas 47259 23755 23504
## 599 Michigan Gladwin 25501 12818 12683
## 600 California Santa Clara 1868149 939004 929145
## 601 South Carolina Jasper 26549 13344 13205
## 602 Iowa Hancock 11092 5575 5517
## 603 Wisconsin Rusk 14357 7216 7141
## 604 Mississippi Carroll 10338 5196 5142
## 605 Kentucky Butler 12835 6450 6385
## 606 Alabama Limestone 88805 44626 44179
## 607 Colorado Boulder 310032 155795 154237
## 608 Illinois Jo Daviess 22397 11254 11143
## 609 Kentucky Breathitt 13591 6829 6762
## 610 Wisconsin Marathon 135177 67921 67256
## 611 Alabama St. Clair 85864 43141 42723
## 612 Texas Clay 10479 5265 5214
## 613 Georgia Bacon 11222 5638 5584
## 614 New York Columbia 62195 31244 30951
## 615 Montana Lincoln 19337 9714 9623
## 616 Washington Pacific 20645 10371 10274
## 617 Iowa Lee 35369 17767 17602
## 618 Indiana Pike 12687 6373 6314
## 619 Minnesota Redwood 15723 7898 7825
## 620 California San Diego 3223096 1618945 1604151
## 621 Missouri Clinton 20498 10296 10202
## 622 Minnesota Pennington 14110 7087 7023
## 623 Ohio Mercer 40863 20523 20340
## 624 Hawaii Maui 160863 80790 80073
## 625 Ohio Hocking 28914 14521 14393
## 626 Kansas Butler 66092 33192 32900
## 627 Iowa Chickasaw 12244 6149 6095
## 628 Utah Salt Lake 1078958 541831 537127
## 629 Pennsylvania Wyoming 28147 14134 14013
## 630 New Mexico Valencia 76297 38310 37987
## 631 Iowa Fayette 20589 10338 10251
## 632 Kansas Crawford 39304 19733 19571
## 633 California Mariposa 17789 8931 8858
## 634 Wyoming Laramie 95431 47911 47520
## 635 Washington Snohomish 746653 374847 371806
## 636 Kentucky Edmonson 12105 6077 6028
## 637 Illinois Knox 52112 26159 25953
## 638 Wisconsin Calumet 49678 24937 24741
## 639 Kentucky Magoffin 12979 6515 6464
## 640 Ohio Vinton 13234 6643 6591
## 641 Illinois Jackson 59534 29883 29651
## 642 Oklahoma Le Flore 49899 25046 24853
## 643 Nebraska Cass 25360 12729 12631
## 644 Washington Douglas 39599 19875 19724
## 645 Ohio Warren 219916 110375 109541
## 646 Missouri Warren 33043 16584 16459
## 647 Arkansas Little River 12720 6384 6336
## 648 Tennessee Stewart 13286 6668 6618
## 649 Texas Hutchinson 21858 10970 10888
## 650 Hawaii Kauai 69691 34971 34720
## 651 Texas Gonzales 20172 10122 10050
## 652 Texas Chambers 37251 18691 18560
## 653 Colorado Elbert 23855 11969 11886
## 654 Ohio Ashtabula 99777 50062 49715
## 655 Alabama Fayette 16896 8477 8419
## 656 Virginia Floyd 15523 7788 7735
## 657 Utah Weber 238682 119748 118934
## 658 Iowa Boone 26401 13244 13157
## 659 Ohio Columbiana 105987 53166 52821
## 660 Colorado Prowers 12235 6137 6098
## 661 West Virginia Braxton 14466 7256 7210
## 662 Tennessee Polk 16687 8370 8317
## 663 Virginia King George 24933 12506 12427
## 664 Texas Wilson 45509 22823 22686
## 665 California Inyo 18373 9214 9159
## 666 Wisconsin Vilas 21355 10709 10646
## 667 Minnesota Otter Tail 57511 28840 28671
## 668 Idaho Gem 16731 8390 8341
## 669 Arizona Navajo 107656 53984 53672
## 670 Illinois LaSalle 112579 56448 56131
## 671 Vermont Orleans 27146 13611 13535
## 672 Michigan Wexford 32751 16421 16330
## 673 Nebraska Lancaster 298080 149454 148626
## 674 Mississippi Oktibbeha 49048 24592 24456
## 675 New York Delaware 46901 23515 23386
## 676 Colorado Delta 30214 15148 15066
## 677 Georgia Pickens 29740 14910 14830
## 678 Wisconsin Sauk 62992 31578 31414
## 679 California Santa Barbara 435850 218483 217367
## 680 New Mexico Sierra 11615 5822 5793
## 681 West Virginia Raleigh 78493 39343 39150
## 682 Wisconsin Marinette 41287 20693 20594
## 683 Missouri Barry 35726 17905 17821
## 684 Ohio Shelby 49067 24590 24477
## 685 Minnesota Fillmore 20843 10445 10398
## 686 Indiana Noble 47546 23826 23720
## 687 Oregon Yamhill 101119 50669 50450
## 688 New York Livingston 64801 32469 32332
## 689 Connecticut Tolland 151948 76134 75814
## 690 Nevada Clark 2035572 1019927 1015645
## 691 South Dakota Minnehaha 178942 89658 89284
## 692 California Lake 64158 32146 32012
## 693 Hawaii Hawaii 191482 95939 95543
## 694 Indiana Brown 15011 7521 7490
## 695 Indiana Fountain 16888 8461 8427
## 696 Arkansas Logan 22001 11022 10979
## 697 Virginia Warren 38481 19278 19203
## 698 Minnesota Becker 33138 16601 16537
## 699 Wisconsin Langlade 19551 9794 9757
## 700 Illinois Washington 14457 7242 7215
## 701 Virginia Carroll 29856 14955 14901
## 702 California Tulare 454033 227426 226607
## 703 Georgia Lee 28946 14499 14447
## 704 Oklahoma Custer 28978 14515 14463
## 705 Arkansas Madison 15702 7865 7837
## 706 New Mexico Colfax 12997 6510 6487
## 707 Oklahoma Pawnee 16499 8264 8235
## 708 Oklahoma Osage 48054 24068 23986
## 709 New York Oswego 121183 60691 60492
## 710 Washington Yakima 247408 123907 123501
## 711 Indiana Daviess 32411 16232 16179
## 712 South Dakota Roberts 10318 5167 5151
## 713 Texas Wise 61243 30667 30576
## 714 Idaho Bonner 41066 20563 20503
## 715 Texas Washington 34236 17143 17093
## 716 Montana Roosevelt 11072 5544 5528
## 717 Kentucky Spencer 17577 8801 8776
## 718 Georgia Barrow 72012 36057 35955
## 719 Wisconsin Portage 70432 35265 35167
## 720 Montana Cascade 82090 41098 40992
## 721 Ohio Holmes 43436 21746 21690
## 722 Oklahoma Washita 11649 5832 5817
## 723 Nebraska Red Willow 10946 5480 5466
## 724 Wisconsin Shawano 41563 20808 20755
## 725 Tennessee Monroe 45293 22675 22618
## 726 California Mendocino 87544 43827 43717
## 727 Ohio Perry 36025 18035 17990
## 728 Indiana Gibson 33668 16855 16813
## 729 Minnesota Koochiching 13054 6535 6519
## 730 Michigan Tuscola 54420 27243 27177
## 731 Minnesota Anoka 338764 169586 169178
## 732 Montana Carbon 10268 5140 5128
## 733 Minnesota Douglas 36620 18331 18289
## 734 Washington Kittitas 42204 21126 21078
## 735 Nebraska Hall 60792 30430 30362
## 736 South Dakota Pennington 106085 53101 52984
## 737 Ohio Putnam 34184 17110 17074
## 738 Indiana Madison 130280 65208 65072
## 739 North Carolina Stanly 60586 30324 30262
## 740 Minnesota Beltrami 45434 22740 22694
## 741 Mississippi Leake 23153 11588 11565
## 742 Ohio Morrow 34996 17515 17481
## 743 Missouri Bollinger 12356 6184 6172
## 744 New York Orange 375384 187873 187511
## 745 Pennsylvania Warren 40962 20500 20462
## 746 Pennsylvania Potter 17377 8696 8681
## 747 Utah Iron 47139 23588 23551
## 748 Iowa Dickinson 16967 8490 8477
## 749 Texas Calhoun 21666 10840 10826
## 750 Missouri Buchanan 89561 44809 44752
## 751 Ohio Pike 28396 14207 14189
## 752 Idaho Ada 417501 208879 208622
## 753 Kentucky Lewis 13790 6899 6891
## 754 Idaho Bingham 45407 22715 22692
## 755 South Dakota Clay 14011 7009 7002
## 756 North Carolina Pender 55166 27596 27570
## 757 Washington Lewis 75515 37774 37741
## 758 Iowa Mahaska 22396 11202 11194
## 759 Ohio Monroe 14547 7276 7271
## 760 Virginia Arlington 223945 112006 111939
## 761 Iowa O'Brien 14092 7048 7044
## 762 Indiana Cass 38476 19243 19233
## 763 Texas Parker 121418 60724 60694
## 764 Missouri Dallas 16564 8284 8280
## 765 California Humboldt 135034 67533 67501
## 766 Texas San Patricio 66070 33039 33031
## 767 Indiana Whitley 33330 16667 16663
## 768 Kansas Douglas 114967 57490 57477
## 769 Missouri Lawrence 38244 19124 19120
## 770 Oklahoma Kingfisher 15302 7651 7651
## 771 Oregon Columbia 49389 24694 24695
## 772 Wisconsin Iowa 23769 11884 11885
## 773 Wisconsin Ashland 15993 7996 7997
## 774 Minnesota Dodge 20290 10144 10146
## 775 Minnesota Houston 18812 9405 9407
## 776 Louisiana Avoyelles 41389 20692 20697
## 777 Texas Zapata 14308 7153 7155
## 778 Indiana Harrison 39230 19612 19618
## 779 Ohio Athens 64974 32482 32492
## 780 Connecticut New London 273185 136570 136615
## 781 Missouri Saline 23334 11665 11669
## 782 California San Benito 57557 28773 28784
## 783 Nebraska Sarpy 169192 84574 84618
## 784 Iowa Franklin 10489 5243 5246
## 785 Colorado Denver 649654 324730 324924
## 786 Michigan Livingston 184591 92258 92333
## 787 Tennessee DeKalb 19038 9515 9523
## 788 Texas Robertson 16532 8262 8270
## 789 Minnesota Hubbard 20574 10282 10292
## 790 Mississippi Kemper 10211 5103 5108
## 791 Ohio Carroll 28361 14173 14188
## 792 Illinois Carroll 14926 7459 7467
## 793 New York Schoharie 31913 15946 15967
## 794 Arkansas Washington 216432 108144 108288
## 795 Illinois Effingham 34332 17154 17178
## 796 Texas Young 18329 9158 9171
## 797 Indiana Fulton 20527 10256 10271
## 798 Ohio Seneca 55929 27944 27985
## 799 Washington Benton 184930 92396 92534
## 800 Iowa Clayton 17806 8896 8910
## 801 California El Dorado 182093 90970 91123
## 802 Kansas Marshall 10005 4998 5007
## 803 Iowa Cherokee 11853 5921 5932
## 804 Wyoming Fremont 40755 20358 20397
## 805 Missouri Howard 10182 5086 5096
## 806 Alabama Washington 16997 8490 8507
## 807 Indiana Jackson 43471 21713 21758
## 808 Iowa Benton 25803 12888 12915
## 809 Arkansas Cleburne 25711 12842 12869
## 810 Texas Bandera 20796 10387 10409
## 811 Kentucky Todd 12524 6255 6269
## 812 Michigan Benzie 17437 8708 8729
## 813 South Dakota Beadle 18168 9072 9096
## 814 Indiana Tipton 15573 7776 7797
## 815 Iowa Kossuth 15280 7629 7651
## 816 Texas Johnson 155450 77610 77840
## 817 Oklahoma Cleveland 268614 134102 134512
## 818 Indiana Dubois 42291 21113 21178
## 819 Pennsylvania Bedford 49086 24504 24582
## 820 Illinois Boone 53851 26882 26969
## 821 Wisconsin Douglas 43799 21864 21935
## 822 Alabama DeKalb 71068 35474 35594
## 823 Michigan Ogemaw 21222 10593 10629
## 824 Minnesota Mower 39227 19579 19648
## 825 Michigan Iosco 25401 12678 12723
## 826 Montana Flathead 93333 46583 46750
## 827 Oklahoma Adair 22236 11098 11138
## 828 Virginia Pulaski 34528 17232 17296
## 829 New Hampshire Carroll 47513 23712 23801
## 830 Indiana Franklin 22935 11446 11489
## 831 Wisconsin Walworth 103039 51420 51619
## 832 Washington King 2045756 1020901 1024855
## 833 Texas Bell 326041 162705 163336
## 834 Minnesota Crow Wing 63048 31459 31589
## 835 Iowa Muscatine 42913 21412 21501
## 836 Kentucky Hardin 107529 53651 53878
## 837 Oklahoma McClain 36512 18217 18295
## 838 Kentucky Webster 13357 6664 6693
## 839 Illinois Grundy 50277 25083 25194
## 840 Ohio Champaign 39393 19653 19740
## 841 California Fresno 956749 477316 479433
## 842 Missouri Douglas 13516 6743 6773
## 843 Illinois Champaign 205766 102654 103112
## 844 Alabama Cherokee 26008 12975 13033
## 845 Illinois Lake 702898 350658 352240
## 846 Georgia Hall 187916 93746 94170
## 847 Illinois Pike 16144 8053 8091
## 848 Pennsylvania Pike 56632 28249 28383
## 849 Minnesota Lyon 25699 12818 12881
## 850 Pennsylvania Juniata 24829 12384 12445
## 851 Michigan Presque Isle 13037 6502 6535
## 852 Iowa Delaware 17507 8731 8776
## 853 Utah Cache 117449 58573 58876
## 854 Louisiana Madison 11873 5921 5952
## 855 Maryland St. Mary's 109614 54662 54952
## 856 Kansas Osage 16080 8018 8062
## 857 California Tehama 63152 31489 31663
## 858 Kentucky Pendleton 14514 7237 7277
## 859 Wisconsin Lincoln 28286 14104 14182
## 860 Virginia Manassas city 40743 20314 20429
## 861 Washington Chelan 74267 37027 37240
## 862 Indiana Hendricks 153435 76496 76939
## 863 Wisconsin Jefferson 84345 42050 42295
## 864 New York Putnam 99488 49597 49891
## 865 Indiana Dearborn 49679 24766 24913
## 866 South Carolina Berkeley 193613 96518 97095
## 867 Maine Piscataquis 17156 8552 8604
## 868 Kansas Ellis 28993 14452 14541
## 869 Wisconsin Outagamie 180430 89937 90493
## 870 Missouri Ray 23031 11480 11551
## 871 Minnesota Nicollet 33086 16492 16594
## 872 Minnesota Scott 137322 68449 68873
## 873 Indiana Kosciusko 77983 38870 39113
## 874 Washington Island 79329 39540 39789
## 875 New York Fulton 54606 27217 27389
## 876 Wisconsin St. Croix 86118 42922 43196
## 877 Missouri Benton 18854 9397 9457
## 878 Idaho Bannock 83604 41666 41938
## 879 North Carolina Watauga 52240 26035 26205
## 880 Arkansas Perry 10300 5133 5167
## 881 Iowa Madison 15644 7796 7848
## 882 New Mexico Luna 24789 12353 12436
## 883 Missouri Wayne 13397 6676 6721
## 884 Virginia Prince William 437271 217901 219370
## 885 Illinois Union 17551 8746 8805
## 886 Tennessee Montgomery 185980 92677 93303
## 887 Iowa Wright 12936 6446 6490
## 888 Montana Custer 11945 5952 5993
## 889 Missouri Lincoln 53850 26831 27019
## 890 Texas Wilbarger 13158 6556 6602
## 891 Oklahoma Marshall 16014 7979 8035
## 892 New Jersey Hunterdon 126250 62901 63349
## 893 Montana Big Horn 13141 6547 6594
## 894 Kentucky Ohio 24065 11989 12076
## 895 Illinois Ogle 52397 26103 26294
## 896 North Carolina Currituck 24492 12201 12291
## 897 Missouri Nodaway 23186 11550 11636
## 898 Wisconsin Manitowoc 80521 40110 40411
## 899 Michigan Roscommon 24068 11989 12079
## 900 New York Genesee 59458 29617 29841
## 901 Michigan Allegan 112837 56205 56632
## 902 Wisconsin Vernon 30279 15082 15197
## 903 New York Oneida 233558 116334 117224
## 904 Kentucky Henry 15455 7698 7757
## 905 Illinois McHenry 307357 153090 154267
## 906 South Carolina Pickens 120124 59831 60293
## 907 Ohio Coshocton 36724 18291 18433
## 908 Kentucky Grayson 26001 12950 13051
## 909 New York Chenango 49549 24678 24871
## 910 Minnesota Goodhue 46377 23098 23279
## 911 Texas Austin 28886 14386 14500
## 912 Michigan Antrim 23267 11587 11680
## 913 Texas Ward 11225 5590 5635
## 914 Missouri Cedar 13892 6918 6974
## 915 Georgia Berrien 19019 9471 9548
## 916 Texas Hays 177562 88420 89142
## 917 Nebraska Buffalo 47958 23881 24077
## 918 Texas Burleson 17293 8611 8682
## 919 Illinois Shelby 22115 11012 11103
## 920 Michigan Cheboygan 25690 12791 12899
## 921 Oregon Hood River 22749 11326 11423
## 922 Kentucky Jackson 13357 6650 6707
## 923 Kansas Dickinson 19516 9716 9800
## 924 Georgia Effingham 54630 27196 27434
## 925 Indiana DeKalb 42449 21132 21317
## 926 West Virginia Lewis 16434 8181 8253
## 927 Tennessee Grainger 22736 11318 11418
## 928 North Dakota Barnes 11097 5524 5573
## 929 Virginia Charlotte 12313 6129 6184
## 930 Oregon Marion 323259 160907 162352
## 931 Texas Midland 151290 75306 75984
## 932 West Virginia Grant 11815 5881 5934
## 933 South Dakota Lincoln 49874 24825 25049
## 934 West Virginia Upshur 24560 12223 12337
## 935 California Solano 425753 211881 213872
## 936 Colorado Larimer 318227 158367 159860
## 937 Oklahoma Lincoln 34504 17171 17333
## 938 Ohio Morgan 14913 7421 7492
## 939 Ohio Williams 37386 18604 18782
## 940 Illinois Kane 524886 261189 263697
## 941 New York Orleans 42204 21001 21203
## 942 California Riverside 2298032 1143477 1154555
## 943 Michigan Mason 28711 14286 14425
## 944 New Mexico Chaves 65811 32746 33065
## 945 Tennessee Union 19096 9501 9595
## 946 Kansas Jackson 13400 6667 6733
## 947 Texas Upshur 40096 19949 20147
## 948 Ohio Hardin 31736 15789 15947
## 949 California Napa 140295 69798 70497
## 950 North Carolina Madison 21027 10461 10566
## 951 Utah San Juan 15152 7538 7614
## 952 Oklahoma Murray 13733 6832 6901
## 953 Michigan Hillsdale 46178 22972 23206
## 954 California San Bernardino 2094769 1042053 1052716
## 955 Georgia Harris 32776 16304 16472
## 956 Iowa Appanoose 12669 6302 6367
## 957 Oregon Klamath 65972 32815 33157
## 958 Indiana Monroe 142404 70830 71574
## 959 Texas Harris 4356362 2166727 2189635
## 960 Iowa Johnson 139436 69350 70086
## 961 Indiana Bartholomew 79488 39534 39954
## 962 California San Joaquin 708554 352400 356154
## 963 New York Ulster 181300 90169 91131
## 964 Georgia Baldwin 45795 22776 23019
## 965 Georgia Whitfield 103456 51451 52005
## 966 Louisiana Sabine 24248 12059 12189
## 967 Iowa Winneshiek 20884 10386 10498
## 968 Oklahoma Rogers 89190 44355 44835
## 969 New York Dutchess 296928 147663 149265
## 970 Iowa Jackson 19572 9733 9839
## 971 Texas Atascosa 47050 23397 23653
## 972 Missouri Perry 19100 9498 9602
## 973 Ohio Paulding 19165 9530 9635
## 974 Georgia Putnam 21247 10565 10682
## 975 Arizona Gila 53165 26436 26729
## 976 California Siskiyou 43895 21825 22070
## 977 Minnesota Benton 39221 19500 19721
## 978 Missouri Jefferson 221577 110157 111420
## 979 Ohio Fairfield 149112 74128 74984
## 980 California Sutter 95247 47349 47898
## 981 Iowa Marion 33248 16527 16721
## 982 Indiana Posey 25567 12708 12859
## 983 Kentucky Boyle 29388 14607 14781
## 984 Minnesota Brown 25391 12620 12771
## 985 Vermont Lamoille 25027 12439 12588
## 986 Colorado Jefferson 552344 274525 277819
## 987 Missouri Franklin 101828 50610 51218
## 988 Pennsylvania Armstrong 67979 33786 34193
## 989 Indiana Carroll 20014 9947 10067
## 990 Montana Lewis and Clark 65357 32482 32875
## 991 Illinois Richland 16127 8015 8112
## 992 Oklahoma Okmulgee 39446 19604 19842
## 993 Wisconsin Pierce 40799 20275 20524
## 994 Louisiana St. Bernard 42858 21298 21560
## 995 North Carolina Burke 89548 44499 45049
## 996 Pennsylvania Jefferson 44756 22240 22516
## 997 Ohio Defiance 38669 19215 19454
## 998 Georgia Ware 35723 17751 17972
## 999 Georgia Colquitt 46024 22869 23155
## 1000 Indiana Marshall 46962 23335 23627
## 1001 Oklahoma Garfield 62192 30902 31290
## 1002 Georgia Forsyth 196236 97505 98731
## 1003 Nevada Nye 42625 21179 21446
## 1004 Vermont Addison 36943 18355 18588
## 1005 Florida Escambia 306327 152196 154131
## 1006 New Jersey Hudson 662619 329204 333415
## 1007 Illinois Kendall 120036 59632 60404
## 1008 Mississippi Itawamba 23451 11650 11801
## 1009 Michigan Huron 32290 16041 16249
## 1010 Colorado Morgan 28359 14088 14271
## 1011 Wisconsin Washburn 15700 7799 7901
## 1012 Oklahoma Mayes 41007 20370 20637
## 1013 Georgia Oglethorpe 14688 7296 7392
## 1014 Virginia Frederick 81340 40404 40936
## 1015 Maryland Cecil 101960 50644 51316
## 1016 Indiana Greene 32815 16298 16517
## 1017 Washington Pierce 821952 408209 413743
## 1018 New Mexico Lincoln 19931 9898 10033
## 1019 Vermont Caledonia 31012 15400 15612
## 1020 North Dakota Burleigh 88223 43809 44414
## 1021 California Santa Cruz 269278 133714 135564
## 1022 Florida Suwannee 43595 21647 21948
## 1023 West Virginia Boone 24000 11917 12083
## 1024 Arkansas Marion 16458 8172 8286
## 1025 Iowa Buchanan 20998 10426 10572
## 1026 Texas Milam 24344 12087 12257
## 1027 Illinois Will 683995 339609 344386
## 1028 Arkansas Jackson 17597 8737 8860
## 1029 Pennsylvania Tioga 42284 20994 21290
## 1030 Kentucky Bath 11978 5947 6031
## 1031 Indiana Washington 27930 13867 14063
## 1032 Illinois Vermilion 80368 39901 40467
## 1033 Illinois Williamson 67121 33323 33798
## 1034 Missouri Montgomery 11939 5927 6012
## 1035 Indiana White 24388 12107 12281
## 1036 Kentucky Monroe 10765 5344 5421
## 1037 Texas San Jacinto 27023 13414 13609
## 1038 Iowa Butler 14966 7429 7537
## 1039 Ohio Ottawa 41162 20432 20730
## 1040 North Carolina Alleghany 10911 5416 5495
## 1041 Kentucky Knott 16000 7942 8058
## 1042 New Mexico San Miguel 28668 14230 14438
## 1043 Washington Clallam 72397 35935 36462
## 1044 West Virginia Lincoln 21560 10701 10859
## 1045 Virginia Prince Edward 23022 11426 11596
## 1046 Illinois Mercer 16107 7994 8113
## 1047 Georgia Gordon 55889 27738 28151
## 1048 North Carolina Franklin 62296 30917 31379
## 1049 Colorado Douglas 306974 152339 154635
## 1050 Minnesota McLeod 36046 17888 18158
## 1051 Michigan Muskegon 171483 85094 86389
## 1052 Illinois Henry 49883 24752 25131
## 1053 Michigan St. Joseph 61022 30279 30743
## 1054 Virginia Dinwiddie 28110 13947 14163
## 1055 Virginia Tazewell 43870 21766 22104
## 1056 Iowa Hardin 17393 8629 8764
## 1057 Kentucky Grant 24670 12239 12431
## 1058 Iowa Mitchell 10762 5339 5423
## 1059 Colorado Mesa 147834 73340 74494
## 1060 Michigan Sanilac 42014 20843 21171
## 1061 Michigan St. Clair 160429 79585 80844
## 1062 Texas Brown 37833 18768 19065
## 1063 Kansas McPherson 29252 14511 14741
## 1064 New Jersey Sussex 145930 72391 73539
## 1065 Tennessee Humphreys 18240 9048 9192
## 1066 Texas Runnels 10445 5181 5264
## 1067 Indiana Jasper 33448 16591 16857
## 1068 Vermont Orange 28929 14349 14580
## 1069 Mississippi Harrison 196268 97348 98920
## 1070 Tennessee Cheatham 39422 19553 19869
## 1071 Wisconsin Brown 254717 126333 128384
## 1072 Montana Ravalli 40823 20247 20576
## 1073 Indiana Orange 19725 9783 9942
## 1074 Kentucky Wayne 20655 10244 10411
## 1075 Wisconsin Green 37044 18372 18672
## 1076 New York Chemung 88267 43776 44491
## 1077 Georgia Dawson 22673 11244 11429
## 1078 Pennsylvania Elk 31370 15557 15813
## 1079 Texas Burnet 44144 21891 22253
## 1080 Arkansas Sevier 17268 8563 8705
## 1081 Kentucky Clinton 10188 5052 5136
## 1082 New York Chautauqua 132646 65776 66870
## 1083 West Virginia Jackson 29256 14506 14750
## 1084 Oklahoma Canadian 126193 62570 63623
## 1085 Iowa Floyd 16050 7958 8092
## 1086 Ohio Auglaize 45873 22745 23128
## 1087 Oklahoma Creek 70761 35084 35677
## 1088 Kentucky Breckinridge 20061 9945 10116
## 1089 Indiana Boone 60511 29997 30514
## 1090 Idaho Canyon 198921 98609 100312
## 1091 Texas Zavala 12060 5978 6082
## 1092 Vermont Franklin 48418 24000 24418
## 1093 Minnesota Wadena 13759 6820 6939
## 1094 New Mexico San Juan 125133 62024 63109
## 1095 Wisconsin Washington 132921 65880 67041
## 1096 Iowa Harrison 14467 7170 7297
## 1097 Oregon Wasco 25492 12634 12858
## 1098 Oklahoma Grady 53612 26570 27042
## 1099 Wisconsin Dane 510198 252850 257348
## 1100 Washington Whatcom 207100 102637 104463
## 1101 Oregon Clatsop 37382 18526 18856
## 1102 New Hampshire Hillsborough 403972 200201 203771
## 1103 Nebraska Custer 10802 5353 5449
## 1104 Kentucky Boyd 48917 24241 24676
## 1105 Connecticut Windham 117470 58212 59258
## 1106 North Carolina Lincoln 79578 39434 40144
## 1107 Virginia Loudoun 351129 173986 177143
## 1108 Missouri Andrew 17328 8586 8742
## 1109 Alabama Dale 49866 24708 25158
## 1110 New York Steuben 98665 48887 49778
## 1111 Maine Knox 39723 19682 20041
## 1112 Colorado Broomfield 60699 30075 30624
## 1113 Washington Spokane 480832 238241 242591
## 1114 Texas Orange 83217 41232 41985
## 1115 Ohio Scioto 78017 38655 39362
## 1116 Louisiana Ascension 114738 56848 57890
## 1117 Idaho Bonneville 107788 53404 54384
## 1118 Kansas Pottawatomie 22625 11209 11416
## 1119 Iowa Iowa 16344 8097 8247
## 1120 Illinois Ford 13835 6854 6981
## 1121 West Virginia Wyoming 22866 11328 11538
## 1122 Maine Oxford 57421 28446 28975
## 1123 Kansas Sedgwick 506529 250929 255600
## 1124 Texas Wood 42712 21159 21553
## 1125 Michigan Dickinson 26012 12886 13126
## 1126 Georgia Bulloch 72386 35859 36527
## 1127 Missouri Gasconade 14948 7405 7543
## 1128 Nebraska Holt 10398 5151 5247
## 1129 Illinois Wayne 16555 8201 8354
## 1130 Illinois Piatt 16495 8171 8324
## 1131 Washington Skagit 119343 59117 60226
## 1132 Georgia Pierce 18934 9379 9555
## 1133 Tennessee Smith 19136 9479 9657
## 1134 Georgia Grady 25243 12504 12739
## 1135 Arkansas Fulton 12224 6055 6169
## 1136 Minnesota Freeborn 30897 15304 15593
## 1137 California Ventura 840833 416484 424349
## 1138 Colorado Montezuma 25700 12729 12971
## 1139 Kentucky Lawrence 15821 7836 7985
## 1140 Kansas Marion 12290 6087 6203
## 1141 Missouri Newton 58777 29111 29666
## 1142 Texas Montgomery 502586 248919 253667
## 1143 Texas Aransas 24292 12031 12261
## 1144 Maryland Garrett 29813 14765 15048
## 1145 Ohio Preble 41682 20643 21039
## 1146 Massachusetts Dukes 17048 8443 8605
## 1147 Georgia Jackson 61420 30418 31002
## 1148 Georgia Heard 11617 5753 5864
## 1149 Virginia Botetourt 33155 16419 16736
## 1150 Illinois Woodford 39106 19366 19740
## 1151 Indiana Rush 16991 8414 8577
## 1152 Nebraska Madison 35111 17387 17724
## 1153 Idaho Nez Perce 39779 19697 20082
## 1154 Iowa Guthrie 10740 5318 5422
## 1155 Ohio Wayne 115371 57126 58245
## 1156 Pennsylvania Snyder 40046 19828 20218
## 1157 Illinois Mason 14126 6994 7132
## 1158 South Dakota Lawrence 24645 12202 12443
## 1159 Minnesota Carver 95715 47387 48328
## 1160 Texas Matagorda 36598 18119 18479
## 1161 Arkansas Greene 43382 21477 21905
## 1162 Virginia Rockbridge 22444 11111 11333
## 1163 Tennessee Unicoi 18069 8945 9124
## 1164 Alabama Cullman 80965 40081 40884
## 1165 Texas Palo Pinto 27921 13822 14099
## 1166 New York Cattaraugus 78962 39089 39873
## 1167 Oklahoma Haskell 12850 6361 6489
## 1168 Georgia Brantley 18452 9134 9318
## 1169 Florida Bay 175353 86800 88553
## 1170 California Stanislaus 527367 261045 266322
## 1171 Missouri Ripley 13990 6925 7065
## 1172 Michigan Delta 36712 18172 18540
## 1173 Texas Coryell 76128 37681 38447
## 1174 Nebraska Otoe 15842 7841 8001
## 1175 Arkansas Johnson 25930 12834 13096
## 1176 Mississippi Amite 12840 6355 6485
## 1177 Wisconsin Racine 194895 96458 98437
## 1178 Virginia Fairfax 1128722 558606 570116
## 1179 Maryland Anne Arundel 555280 274800 280480
## 1180 Indiana Decatur 26240 12985 13255
## 1181 Ohio Delaware 185433 91762 93671
## 1182 Michigan Van Buren 75351 37287 38064
## 1183 Tennessee Warren 40015 19801 20214
## 1184 Arkansas Yell 21835 10804 11031
## 1185 Oklahoma Kay 45587 22556 23031
## 1186 New York Wayne 92416 45726 46690
## 1187 Louisiana Plaquemines 23599 11676 11923
## 1188 Indiana Morgan 69403 34338 35065
## 1189 California Butte 222564 110115 112449
## 1190 Virginia Washington 54759 27092 27667
## 1191 Georgia Hart 25498 12615 12883
## 1192 Texas Comal 119632 59187 60445
## 1193 Alabama Coffee 50884 25174 25710
## 1194 Virginia Middlesex 10717 5302 5415
## 1195 Arkansas Pope 62830 31083 31747
## 1196 Kansas Sumner 23638 11694 11944
## 1197 Illinois Macoupin 46844 23174 23670
## 1198 Kentucky Logan 26851 13283 13568
## 1199 Florida Martin 151586 74988 76598
## 1200 Louisiana Jefferson Davis 31434 15550 15884
## 1201 Nebraska Gage 21818 10793 11025
## 1202 Colorado Rio Grande 11745 5810 5935
## 1203 Minnesota Watonwan 11054 5468 5586
## 1204 North Carolina Scotland 35932 17774 18158
## 1205 Texas Tom Green 115056 56913 58143
## 1206 Indiana Warrick 60995 30171 30824
## 1207 Tennessee Tipton 61674 30506 31168
## 1208 California Calaveras 44767 22143 22624
## 1209 Tennessee Rhea 32394 16022 16372
## 1210 Oklahoma Jackson 26056 12887 13169
## 1211 West Virginia Ritchie 10140 5015 5125
## 1212 Idaho Twin Falls 80004 39568 40436
## 1213 Missouri Iron 10322 5105 5217
## 1214 Texas Galveston 308163 152405 155758
## 1215 Oklahoma Sequoyah 41464 20506 20958
## 1216 Washington Clark 444506 219826 224680
## 1217 Virginia Caroline 29349 14513 14836
## 1218 Arizona Apache 72124 35663 36461
## 1219 Arkansas Benton 238198 117781 120417
## 1220 Kansas Bourbon 14812 7324 7488
## 1221 Ohio Clinton 41892 20714 21178
## 1222 Tennessee Grundy 13524 6687 6837
## 1223 Ohio Brown 44247 21878 22369
## 1224 Idaho Kootenai 145046 71718 73328
## 1225 Virginia Winchester city 27168 13433 13735
## 1226 Wisconsin Wood 74012 36594 37418
## 1227 New Hampshire Rockingham 299006 147837 151169
## 1228 Ohio Adams 28229 13957 14272
## 1229 Georgia Walker 68285 33760 34525
## 1230 Georgia Jeff Davis 14990 7411 7579
## 1231 Kentucky Bullitt 76961 38049 38912
## 1232 Oklahoma Logan 44493 21995 22498
## 1233 West Virginia Monroe 13525 6686 6839
## 1234 Pennsylvania Adams 101767 50307 51460
## 1235 North Carolina Union 213422 105500 107922
## 1236 Maryland Queen Anne's 48600 24024 24576
## 1237 Pennsylvania York 439660 217323 222337
## 1238 Arizona Maricopa 4018143 1986158 2031985
## 1239 Minnesota Washington 246670 121924 124746
## 1240 Kentucky Nelson 44564 22027 22537
## 1241 Louisiana Terrebonne 112742 55725 57017
## 1242 Virginia Goochland 21721 10736 10985
## 1243 Oregon Linn 118971 58803 60168
## 1244 Iowa Dubuque 95906 47401 48505
## 1245 Kansas Saline 55735 27546 28189
## 1246 Missouri Laclede 35514 17552 17962
## 1247 Michigan Iron 11507 5687 5820
## 1248 Utah Washington 148244 73265 74979
## 1249 Maryland Carroll 167444 82754 84690
## 1250 Oklahoma Delaware 41409 20465 20944
## 1251 Virginia Fauquier 67463 33341 34122
## 1252 Wisconsin Kenosha 167738 82897 84841
## 1253 Texas Hunt 88052 43515 44537
## 1254 Indiana Elkhart 200685 99175 101510
## 1255 Texas Lubbock 290782 143698 147084
## 1256 Nebraska York 13825 6832 6993
## 1257 Arkansas Van Buren 17002 8402 8600
## 1258 South Dakota Butte 10292 5086 5206
## 1259 Arizona Coconino 136701 67553 69148
## 1260 Oregon Multnomah 768418 379725 388693
## 1261 Wyoming Sheridan 29738 14695 15043
## 1262 Maine Aroostook 70005 34592 35413
## 1263 Iowa Tama 17479 8637 8842
## 1264 Indiana Fayette 23773 11747 12026
## 1265 Virginia Gloucester 37001 18283 18718
## 1266 Kentucky Carter 27326 13502 13824
## 1267 Indiana Lawrence 45814 22637 23177
## 1268 Oklahoma Cherokee 48097 23764 24333
## 1269 California Orange 3116069 1539600 1576469
## 1270 Mississippi Pearl River 55196 27271 27925
## 1271 North Carolina Rowan 138361 68359 70002
## 1272 Alabama Blount 57710 28512 29198
## 1273 West Virginia Roane 14636 7231 7405
## 1274 Kansas Wyandotte 160806 79445 81361
## 1275 Michigan Shiawassee 69113 34144 34969
## 1276 Texas Hood 53171 26267 26904
## 1277 Florida Nassau 75880 37485 38395
## 1278 Oregon Crook 20956 10352 10604
## 1279 Missouri Lafayette 32916 16260 16656
## 1280 Texas Leon 16819 8308 8511
## 1281 Minnesota Clay 60879 30072 30807
## 1282 Kentucky Hart 18441 9109 9332
## 1283 North Carolina Gates 11724 5791 5933
## 1284 New Mexico Grant 29119 14383 14736
## 1285 Georgia Bartow 101336 50052 51284
## 1286 New York Tioga 50199 24794 25405
## 1287 North Carolina Randolph 142370 70316 72054
## 1288 Texas Ellis 157058 77570 79488
## 1289 Utah Carbon 20927 10335 10592
## 1290 Minnesota Martin 20350 10050 10300
## 1291 Kentucky Perry 28041 13848 14193
## 1292 Arkansas Bradley 11206 5534 5672
## 1293 Michigan Washtenaw 354092 174855 179237
## 1294 North Carolina Ashe 27114 13389 13725
## 1295 Oregon Douglas 107194 52932 54262
## 1296 Alabama Marion 30387 15004 15383
## 1297 Maryland Calvert 90114 44495 45619
## 1298 North Carolina Hertford 24368 12032 12336
## 1299 Michigan Emmet 33018 16303 16715
## 1300 Kentucky Boone 124617 61530 63087
## 1301 Michigan Clare 30710 15163 15547
## 1302 Ohio Sandusky 60187 29717 30470
## 1303 Kentucky Adair 18852 9308 9544
## 1304 California Nevada 98570 48668 49902
## 1305 Texas Guadalupe 143460 70832 72628
## 1306 Oklahoma McIntosh 20280 10013 10267
## 1307 Michigan Charlevoix 26134 12903 13231
## 1308 Kentucky Mercer 21342 10537 10805
## 1309 Ohio Clermont 200285 98883 101402
## 1310 New York Rensselaer 159900 78944 80956
## 1311 Oklahoma Nowata 10555 5211 5344
## 1312 Illinois DeKalb 104345 51515 52830
## 1313 Georgia Madison 28232 13938 14294
## 1314 Indiana Adams 34642 17102 17540
## 1315 Nebraska Lincoln 35896 17721 18175
## 1316 Louisiana Bossier 123403 60921 62482
## 1317 Illinois Tazewell 135697 66990 68707
## 1318 Iowa Plymouth 24853 12269 12584
## 1319 Texas Fayette 24849 12267 12582
## 1320 Washington Jefferson 30083 14850 15233
## 1321 Washington Klickitat 20820 10277 10543
## 1322 Kentucky Fleming 14544 7179 7365
## 1323 Illinois Hancock 18738 9249 9489
## 1324 Kentucky Leslie 10997 5428 5569
## 1325 Ohio Highland 43170 21308 21862
## 1326 Kentucky Bourbon 20013 9878 10135
## 1327 Ohio Fulton 42485 20969 21516
## 1328 Tennessee Rutherford 282558 139456 143102
## 1329 Missouri Crawford 24660 12170 12490
## 1330 Texas Cooke 38761 19129 19632
## 1331 Ohio Huron 58937 29086 29851
## 1332 Vermont Washington 59132 29182 29950
## 1333 Michigan Clinton 76905 37952 38953
## 1334 West Virginia Mineral 27755 13696 14059
## 1335 New York Herkimer 64034 31598 32436
## 1336 Texas Hardin 55375 27325 28050
## 1337 Maine Penobscot 153437 75714 77723
## 1338 Texas Hopkins 35645 17589 18056
## 1339 Pennsylvania Butler 185689 91628 94061
## 1340 Arkansas Miller 43652 21540 22112
## 1341 Florida Levy 39821 19649 20172
## 1342 Tennessee Bedford 45986 22691 23295
## 1343 North Carolina Hoke 51075 25202 25873
## 1344 Washington Cowlitz 102338 50495 51843
## 1345 Michigan Otsego 24141 11911 12230
## 1346 West Virginia Morgan 17475 8622 8853
## 1347 Alabama Chilton 43819 21619 22200
## 1348 Kansas Cherokee 20952 10337 10615
## 1349 Indiana Blackford 12476 6155 6321
## 1350 South Carolina Oconee 74949 36975 37974
## 1351 Kentucky Kenton 163007 80417 82590
## 1352 Alabama Jackson 52860 26076 26784
## 1353 Missouri Barton 12166 6001 6165
## 1354 Texas Dallas 2485003 1225722 1259281
## 1355 Pennsylvania Monroe 167881 82806 85075
## 1356 Ohio Fayette 28769 14190 14579
## 1357 Texas Jim Wells 41461 20450 21011
## 1358 Illinois Marshall 12173 6004 6169
## 1359 Iowa Bremer 24539 12103 12436
## 1360 Pennsylvania Cumberland 241427 119074 122353
## 1361 Indiana Wells 27796 13709 14087
## 1362 New York Saratoga 223774 110362 113412
## 1363 Pennsylvania Cambria 139381 68740 70641
## 1364 Maryland Frederick 241373 119038 122335
## 1365 Texas Deaf Smith 19245 9491 9754
## 1366 Ohio Tuscarawas 92697 45715 46982
## 1367 Louisiana Union 22533 11112 11421
## 1368 Ohio Henry 28015 13815 14200
## 1369 Maine Washington 32191 15874 16317
## 1370 Iowa Linn 216640 106827 109813
## 1371 Iowa Wapello 35315 17413 17902
## 1372 South Carolina Beaufort 171420 84523 86897
## 1373 Oklahoma Wagoner 75391 37173 38218
## 1374 Kansas Labette 21048 10378 10670
## 1375 Missouri Platte 93394 46047 47347
## 1376 Massachusetts Worcester 810935 399807 411128
## 1377 North Carolina Wilkes 68946 33990 34956
## 1378 Ohio Medina 174831 86190 88641
## 1379 Louisiana St. Mary 53441 26345 27096
## 1380 Tennessee Jefferson 52490 25876 26614
## 1381 Kansas Montgomery 34184 16851 17333
## 1382 Tennessee Fayette 38814 19133 19681
## 1383 North Carolina Harnett 124320 61281 63039
## 1384 Pennsylvania Carbon 64634 31860 32774
## 1385 Indiana Starke 23117 11395 11722
## 1386 Iowa Pottawattamie 93213 45947 47266
## 1387 Texas Titus 32553 16046 16507
## 1388 Virginia Louisa 33986 16752 17234
## 1389 Oregon Deschutes 166622 82129 84493
## 1390 Kentucky Scott 50178 24733 25445
## 1391 Texas Colorado 20757 10231 10526
## 1392 Texas Navarro 48118 23717 24401
## 1393 Louisiana West Baton Rouge 24669 12159 12510
## 1394 Ohio Wyandot 22467 11073 11394
## 1395 Texas Maverick 56548 27870 28678
## 1396 Kentucky Floyd 38649 19048 19601
## 1397 Virginia Grayson 15573 7675 7898
## 1398 Virginia Patrick 18264 9001 9263
## 1399 West Virginia Nicholas 25930 12779 13151
## 1400 Florida Putnam 72696 35825 36871
## 1401 Ohio Harrison 15633 7704 7929
## 1402 Pennsylvania Bradford 62228 30666 31562
## 1403 Alabama Clay 13537 6671 6866
## 1404 Texas Liberty 77486 38184 39302
## 1405 Pennsylvania Erie 279858 137907 141951
## 1406 Indiana Shelby 44441 21899 22542
## 1407 New Hampshire Grafton 89341 44024 45317
## 1408 North Carolina Columbus 57230 28199 29031
## 1409 Arkansas Clay 15400 7588 7812
## 1410 Arkansas Lonoke 70691 34831 35860
## 1411 Illinois Franklin 39694 19558 20136
## 1412 Texas Montague 19478 9597 9881
## 1413 Iowa Washington 22017 10847 11170
## 1414 California Los Angeles 10038388 4945351 5093037
## 1415 North Carolina Yadkin 37971 18706 19265
## 1416 Texas Bosque 17971 8853 9118
## 1417 Texas Hockley 23322 11489 11833
## 1418 Missouri Camden 43927 21638 22289
## 1419 Tennessee Benton 16261 8010 8251
## 1420 Vermont Windham 43858 21603 22255
## 1421 Texas Van Zandt 52736 25975 26761
## 1422 Missouri Wright 18449 9087 9362
## 1423 Ohio Logan 45484 22402 23082
## 1424 West Virginia Logan 35760 17612 18148
## 1425 Ohio Lorain 303152 149303 153849
## 1426 Illinois Clay 13582 6689 6893
## 1427 North Carolina Duplin 59453 29280 30173
## 1428 Michigan Ottawa 273136 134515 138621
## 1429 Kentucky Casey 15954 7857 8097
## 1430 Nebraska Washington 20257 9976 10281
## 1431 Tennessee Macon 22761 11209 11552
## 1432 Mississippi Marshall 36385 17918 18467
## 1433 Arkansas White 78660 38736 39924
## 1434 Maine Somerset 51577 25399 26178
## 1435 Kansas Miami 32688 16097 16591
## 1436 Connecticut Litchfield 186304 91740 94564
## 1437 Michigan Grand Traverse 89907 44272 45635
## 1438 Colorado Pueblo 161519 79535 81984
## 1439 North Carolina Iredell 165066 81281 83785
## 1440 Minnesota Hennepin 1197776 589781 607995
## 1441 Alabama Crenshaw 13938 6863 7075
## 1442 Kansas Allen 13081 6441 6640
## 1443 Arkansas Franklin 17866 8797 9069
## 1444 South Carolina Abbeville 24997 12308 12689
## 1445 Kentucky Simpson 17704 8717 8987
## 1446 New York Suffolk 1501373 739210 762163
## 1447 Minnesota Chippewa 12154 5984 6170
## 1448 Michigan Monroe 150436 74065 76371
## 1449 Ohio Meigs 23473 11556 11917
## 1450 Indiana Clinton 32835 16165 16670
## 1451 West Virginia Marshall 32480 15990 16490
## 1452 Oklahoma Garvin 27455 13516 13939
## 1453 Tennessee Marshall 31159 15339 15820
## 1454 New York Tompkins 103855 51125 52730
## 1455 New York Montgomery 49779 24504 25275
## 1456 New Mexico Doña Ana 213963 105321 108642
## 1457 Michigan Leelanau 21772 10717 11055
## 1458 Nebraska Douglas 537655 264652 273003
## 1459 Tennessee Overton 22100 10878 11222
## 1460 Nebraska Cheyenne 10077 4960 5117
## 1461 New York Madison 72427 35647 36780
## 1462 Kansas Franklin 25753 12675 13078
## 1463 Alabama Winston 24130 11876 12254
## 1464 Michigan Bay 106698 52513 54185
## 1465 Oregon Washington 556210 273743 282467
## 1466 Pennsylvania Mifflin 46675 22971 23704
## 1467 West Virginia Berkeley 108724 53507 55217
## 1468 Illinois Rock Island 147161 72422 74739
## 1469 New Hampshire Merrimack 147262 72470 74792
## 1470 Oregon Clackamas 389438 191639 197799
## 1471 Indiana Johnson 145645 71670 73975
## 1472 Texas Wharton 41264 20305 20959
## 1473 Kentucky Montgomery 27167 13368 13799
## 1474 Illinois Whiteside 57525 28306 29219
## 1475 Texas Denton 731851 360112 371739
## 1476 Florida Collier 341091 167836 173255
## 1477 California San Mateo 748731 368416 380315
## 1478 Alabama Marshall 94318 46409 47909
## 1479 Missouri Bates 16643 8189 8454
## 1480 Indiana Porter 166570 81958 84612
## 1481 Iowa Woodbury 102530 50448 52082
## 1482 Georgia Cherokee 225944 111170 114774
## 1483 Florida Osceola 300870 148030 152840
## 1484 New Jersey Burlington 450556 221673 228883
## 1485 Kentucky Marshall 31181 15341 15840
## 1486 Pennsylvania Luzerne 320095 157486 162609
## 1487 Iowa Winnebago 10614 5222 5392
## 1488 Tennessee Cannon 13787 6783 7004
## 1489 Pennsylvania Mercer 115320 56734 58586
## 1490 Kentucky Allen 20355 10014 10341
## 1491 Louisiana St. Martin 53126 26136 26990
## 1492 Alabama Lee 150982 74277 76705
## 1493 Texas Nueces 352060 173195 178865
## 1494 Oregon Union 25745 12664 13081
## 1495 Pennsylvania Fayette 134851 66332 68519
## 1496 Montana Lake 29157 14342 14815
## 1497 Mississippi Tippah 22054 10848 11206
## 1498 Tennessee Putnam 73810 36303 37507
## 1499 Iowa Clinton 48365 23788 24577
## 1500 New Jersey Middlesex 830300 408374 421926
## 1501 Wisconsin Door 27731 13639 14092
## 1502 Arizona Pima 998537 491108 507429
## 1503 Texas Erath 40039 19692 20347
## 1504 North Carolina Dare 34863 17146 17717
## 1505 Iowa Clay 16537 8133 8404
## 1506 Iowa Polk 452369 222477 229892
## 1507 Mississippi Jackson 140676 69185 71491
## 1508 Texas Hill 34923 17174 17749
## 1509 Texas Bexar 1825502 897690 927812
## 1510 Missouri Miller 24956 12272 12684
## 1511 Ohio Geauga 93874 46161 47713
## 1512 Iowa Sac 10101 4967 5134
## 1513 Minnesota Winona 51213 25183 26030
## 1514 Arkansas Drew 18740 9215 9525
## 1515 Ohio Miami 103517 50902 52615
## 1516 Arkansas Jefferson 73548 36165 37383
## 1517 Virginia Clarke 14299 7031 7268
## 1518 Kentucky Estill 14476 7118 7358
## 1519 Pennsylvania Chester 509797 250671 259126
## 1520 Colorado Arapahoe 608310 299103 309207
## 1521 Virginia Wythe 29190 14352 14838
## 1522 Michigan Midland 83624 41115 42509
## 1523 Wisconsin Eau Claire 101281 49791 51490
## 1524 Arkansas Grant 18054 8875 9179
## 1525 North Carolina Sampson 63873 31398 32475
## 1526 Tennessee Lawrence 42226 20757 21469
## 1527 Illinois Kankakee 112221 55164 57057
## 1528 Indiana Ripley 28612 14064 14548
## 1529 Georgia Brooks 15637 7686 7951
## 1530 Tennessee Sequatchie 14592 7172 7420
## 1531 New York Broome 198093 97363 100730
## 1532 Ohio Guernsey 39626 19475 20151
## 1533 Maine Waldo 38976 19155 19821
## 1534 Indiana Huntington 36863 18116 18747
## 1535 Texas Williamson 473592 232740 240852
## 1536 Oregon Lane 357060 175470 181590
## 1537 North Carolina Caldwell 81758 40178 41580
## 1538 Virginia York 66471 32665 33806
## 1539 Kentucky Letcher 23671 11632 12039
## 1540 North Carolina Jackson 40812 20055 20757
## 1541 Florida Orange 1229039 603946 625093
## 1542 Georgia Lowndes 113203 55627 57576
## 1543 North Carolina Wayne 124355 61106 63249
## 1544 New Mexico Sandoval 136638 67141 69497
## 1545 Virginia Virginia Beach city 448290 220275 228015
## 1546 Ohio Wood 128885 63329 65556
## 1547 Virginia Spotsylvania 127691 62742 64949
## 1548 Texas Fort Bend 658331 323468 334863
## 1549 Indiana Jay 21255 10443 10812
## 1550 Indiana Scott 23783 11685 12098
## 1551 Maine Franklin 30402 14937 15465
## 1552 North Carolina Johnston 178396 87648 90748
## 1553 Virginia Bedford 76463 37567 38896
## 1554 Pennsylvania Venango 53906 26484 27422
## 1555 Missouri St. Charles 374805 184139 190666
## 1556 Alabama Geneva 26815 13174 13641
## 1557 Louisiana Lafourche 97474 47888 49586
## 1558 Louisiana Washington 46556 22872 23684
## 1559 Kentucky Fayette 308306 151459 156847
## 1560 Iowa Grundy 12407 6095 6312
## 1561 Louisiana Livingston 133949 65803 68146
## 1562 New Mexico Rio Arriba 39949 19625 20324
## 1563 Arkansas Saline 113833 55920 57913
## 1564 Arkansas Chicot 11353 5577 5776
## 1565 Tennessee Meigs 11716 5755 5961
## 1566 Michigan Kent 622590 305818 316772
## 1567 Kentucky Powell 12447 6114 6333
## 1568 Indiana Wabash 32358 15894 16464
## 1569 Kentucky Larue 14149 6949 7200
## 1570 Wisconsin Waukesha 393873 193441 200432
## 1571 Michigan Alpena 29068 14276 14792
## 1572 Pennsylvania Lycoming 116656 57292 59364
## 1573 Colorado Otero 18572 9121 9451
## 1574 North Dakota Rolette 14498 7120 7378
## 1575 Mississippi Lawrence 12586 6181 6405
## 1576 Kentucky Washington 11910 5849 6061
## 1577 Oregon Coos 62775 30828 31947
## 1578 Montana Yellowstone 153692 75474 78218
## 1579 New Hampshire Sullivan 43135 21182 21953
## 1580 West Virginia Jefferson 55214 27113 28101
## 1581 Texas Randall 126782 62256 64526
## 1582 Pennsylvania Berks 413965 203262 210703
## 1583 Virginia Page 23843 11707 12136
## 1584 Florida Clay 197417 96931 100486
## 1585 Alabama Morgan 119786 58814 60972
## 1586 Wisconsin Rock 160727 78915 81812
## 1587 Virginia Alleghany 16066 7888 8178
## 1588 Minnesota Dakota 408456 200534 207922
## 1589 California Shasta 178942 87851 91091
## 1590 Wyoming Park 28985 14230 14755
## 1591 Georgia Muscogee 200285 98321 101964
## 1592 Texas Collin 862215 423260 438955
## 1593 Kentucky Owen 10711 5258 5453
## 1594 Oklahoma Ottawa 32085 15750 16335
## 1595 West Virginia Wetzel 16157 7931 8226
## 1596 Indiana Vermillion 15860 7785 8075
## 1597 Georgia Appling 18417 9040 9377
## 1598 Missouri Oregon 10979 5389 5590
## 1599 Missouri Pettis 42215 20721 21494
## 1600 Ohio Greene 164192 80590 83602
## 1601 Iowa Shelby 11992 5886 6106
## 1602 Iowa Scott 169994 83437 86557
## 1603 Ohio Washington 61351 30112 31239
## 1604 Kentucky Garrard 16976 8332 8644
## 1605 Georgia Lumpkin 30921 15176 15745
## 1606 Indiana Clark 113181 55549 57632
## 1607 Wisconsin Ozaukee 87273 42831 44442
## 1608 Ohio Darke 52356 25694 26662
## 1609 Texas Harrison 66417 32594 33823
## 1610 North Carolina Carteret 68228 33481 34747
## 1611 Tennessee Robertson 67426 33086 34340
## 1612 Kentucky Rockcastle 16942 8313 8629
## 1613 Georgia Rabun 16266 7981 8285
## 1614 Rhode Island Newport 82663 40559 42104
## 1615 Virginia Fairfax city 23402 11482 11920
## 1616 New York Ontario 109192 53574 55618
## 1617 Iowa Warren 47542 23326 24216
## 1618 Tennessee Dickson 50472 24762 25710
## 1619 Oklahoma McCurtain 33143 16260 16883
## 1620 New York Warren 65180 31977 33203
## 1621 Virginia Pittsylvania 62794 30806 31988
## 1622 Illinois Monroe 33539 16452 17087
## 1623 Colorado Montrose 40815 20021 20794
## 1624 Vermont Bennington 36589 17948 18641
## 1625 Mississippi Walthall 14978 7347 7631
## 1626 New Mexico Bernalillo 673943 330578 343365
## 1627 Minnesota Steele 36523 17915 18608
## 1628 Iowa Cedar 18375 9013 9362
## 1629 Oklahoma Johnston 11022 5406 5616
## 1630 Kansas Neosho 16423 8055 8368
## 1631 Vermont Rutland 60530 29688 30842
## 1632 Maryland Howard 304115 149157 154958
## 1633 California Sonoma 495078 242817 252261
## 1634 Texas Rockwall 85536 41950 43586
## 1635 Georgia Gwinnett 859234 421395 437839
## 1636 Arkansas Randolph 17695 8678 9017
## 1637 Illinois Douglas 19826 9723 10103
## 1638 Illinois DuPage 930412 456274 474138
## 1639 Kentucky Campbell 91475 44858 46617
## 1640 North Carolina Pasquotank 40018 19623 20395
## 1641 New York Rockland 320688 157250 163438
## 1642 South Carolina Clarendon 34178 16759 17419
## 1643 Illinois Bureau 34115 16728 17387
## 1644 Virginia Rockingham 77785 38141 39644
## 1645 Texas Kaufman 109289 53588 55701
## 1646 Virginia Hanover 101340 49687 51653
## 1647 Virginia Franklin 56315 27611 28704
## 1648 Pennsylvania Franklin 152285 74663 77622
## 1649 Washington Thurston 262723 128796 133927
## 1650 Mississippi Hancock 45627 22367 23260
## 1651 Arkansas Crawford 61748 30269 31479
## 1652 Oklahoma Pushmataha 11226 5503 5723
## 1653 North Carolina Catawba 154610 75784 78826
## 1654 Pennsylvania Bucks 626583 307126 319457
## 1655 Virginia Giles 16907 8287 8620
## 1656 Iowa Cass 13590 6661 6929
## 1657 Georgia Union 21725 10648 11077
## 1658 Iowa Dallas 74892 36706 38186
## 1659 Georgia Pike 17812 8730 9082
## 1660 Indiana Clay 26686 13079 13607
## 1661 Arkansas Conway 21110 10346 10764
## 1662 Arkansas Faulkner 119343 58488 60855
## 1663 North Carolina Davidson 163867 80308 83559
## 1664 Missouri Dent 15612 7651 7961
## 1665 Missouri Henry 22034 10798 11236
## 1666 West Virginia Marion 56790 27830 28960
## 1667 Kentucky Laurel 59751 29281 30470
## 1668 Texas Victoria 90099 44152 45947
## 1669 California Alameda 1584983 776699 808284
## 1670 Texas Lampasas 20219 9908 10311
## 1671 North Carolina Mitchell 15330 7512 7818
## 1672 Missouri Cass 100781 49384 51397
## 1673 Colorado Alamosa 16269 7972 8297
## 1674 Tennessee Claiborne 31748 15556 16192
## 1675 Florida Lee 663675 325158 338517
## 1676 Tennessee Sevier 93617 45866 47751
## 1677 Arkansas Sebastian 127273 62355 64918
## 1678 New Jersey Morris 498192 244074 254118
## 1679 Florida Polk 626676 307017 319659
## 1680 Texas Eastland 18328 8979 9349
## 1681 Tennessee Wilson 122445 59985 62460
## 1682 North Carolina Surry 73170 35843 37327
## 1683 Texas Comanche 13623 6673 6950
## 1684 Kentucky Trigg 14250 6980 7270
## 1685 Tennessee Fentress 17931 8783 9148
## 1686 Pennsylvania Northampton 299616 146756 152860
## 1687 Georgia Jasper 13593 6658 6935
## 1688 South Carolina Lancaster 81263 39803 41460
## 1689 Ohio Butler 372538 182470 190068
## 1690 Kentucky Clark 35657 17464 18193
## 1691 Arkansas Carroll 27635 13535 14100
## 1692 Kentucky Mason 17296 8471 8825
## 1693 Texas Gregg 123178 60328 62850
## 1694 Maine Androscoggin 107393 52597 54796
## 1695 Tennessee Lincoln 33550 16431 17119
## 1696 Ohio Hancock 75428 36940 38488
## 1697 Wisconsin Fond du Lac 101920 49914 52006
## 1698 Virginia Isle of Wight 35740 17503 18237
## 1699 North Carolina Cherokee 27092 13267 13825
## 1700 Georgia Franklin 22110 10827 11283
## 1701 Tennessee Marion 28306 13861 14445
## 1702 West Virginia Putnam 56596 27714 28882
## 1703 Georgia Polk 41215 20182 21033
## 1704 Alabama Randolph 22648 11090 11558
## 1705 Georgia Emanuel 22731 11130 11601
## 1706 Missouri Polk 31107 15231 15876
## 1707 New Hampshire Belknap 60399 29573 30826
## 1708 Ohio Knox 61004 29869 31135
## 1709 North Carolina Lee 59418 29090 30328
## 1710 Illinois Marion 38665 18929 19736
## 1711 Oregon Josephine 83409 40834 42575
## 1712 Texas Tarrant 1914526 937266 977260
## 1713 Kentucky Jessamine 50328 24638 25690
## 1714 Vermont Windsor 56150 27488 28662
## 1715 Connecticut Middlesex 165165 80854 84311
## 1716 Kentucky Pulaski 63635 31151 32484
## 1717 Virginia Poquoson city 12077 5912 6165
## 1718 Iowa Cerro Gordo 43481 21284 22197
## 1719 Ohio Jackson 32854 16082 16772
## 1720 Missouri Stone 31375 15358 16017
## 1721 California Marin 258349 126460 131889
## 1722 Tennessee Roane 53162 26021 27141
## 1723 Tennessee Scott 22043 10789 11254
## 1724 Georgia Screven 14206 6953 7253
## 1725 North Carolina Stokes 46661 22837 23824
## 1726 Ohio Licking 168693 82562 86131
## 1727 North Carolina Chatham 67431 33001 34430
## 1728 Missouri Clay 230361 112738 117623
## 1729 Ohio Van Wert 28576 13984 14592
## 1730 Indiana Hancock 71328 34905 36423
## 1731 Texas Uvalde 26952 13189 13763
## 1732 New Jersey Mercer 370212 181158 189054
## 1733 Arizona Yavapai 215996 105693 110303
## 1734 Georgia Murray 39401 19280 20121
## 1735 Missouri Madison 12401 6068 6333
## 1736 Tennessee White 26252 12845 13407
## 1737 Kentucky Green 11149 5455 5694
## 1738 Kansas Johnson 566814 277330 289484
## 1739 Pennsylvania Lancaster 530216 259423 270793
## 1740 Minnesota Olmsted 148736 72773 75963
## 1741 Oklahoma Oklahoma 754480 369112 385368
## 1742 California Sacramento 1465832 717117 748715
## 1743 West Virginia Mingo 25931 12686 13245
## 1744 Kentucky Russell 17669 8644 9025
## 1745 Maryland Harford 248966 121798 127168
## 1746 Wisconsin La Crosse 117048 57261 59787
## 1747 Tennessee Carter 56941 27856 29085
## 1748 Ohio Ashland 53189 26020 27169
## 1749 Tennessee Giles 28949 14161 14788
## 1750 Virginia Orange 34596 16922 17674
## 1751 Tennessee Hawkins 56595 27682 28913
## 1752 Illinois Moultrie 14927 7301 7626
## 1753 Mississippi Prentiss 25380 12413 12967
## 1754 Georgia Paulding 147400 72089 75311
## 1755 Illinois McDonough 32009 15654 16355
## 1756 Alabama Madison 346438 169422 177016
## 1757 Mississippi Alcorn 37319 18250 19069
## 1758 Michigan Kalamazoo 256752 125558 131194
## 1759 Illinois Madison 267356 130741 136615
## 1760 Kentucky Pike 63434 31019 32415
## 1761 Ohio Erie 76141 37232 38909
## 1762 Oklahoma Noble 11506 5626 5880
## 1763 Oklahoma Muskogee 70224 34335 35889
## 1764 Nebraska Dodge 36725 17956 18769
## 1765 West Virginia Greenbrier 35666 17438 18228
## 1766 Alabama Lawrence 33586 16421 17165
## 1767 Pennsylvania Lebanon 135776 66380 69396
## 1768 Alabama Cleburne 15002 7334 7668
## 1769 Georgia Oconee 34400 16817 17583
## 1770 Arkansas Pike 11087 5420 5667
## 1771 Arkansas Independence 36952 18064 18888
## 1772 Texas Callahan 13532 6615 6917
## 1773 Georgia Bryan 33151 16205 16946
## 1774 Virginia Smyth 31734 15512 16222
## 1775 Virginia Westmoreland 17557 8582 8975
## 1776 Kansas Harvey 34835 17026 17809
## 1777 Illinois Jersey 22625 11058 11567
## 1778 Texas Taylor 134435 65704 68731
## 1779 Texas Angelina 87748 42886 44862
## 1780 Pennsylvania Crawford 87343 42688 44655
## 1781 Minnesota Cottonwood 11632 5685 5947
## 1782 Kentucky Whitley 35794 17493 18301
## 1783 Michigan Berrien 155565 76026 79539
## 1784 Florida Brevard 553591 270544 283047
## 1785 Virginia Russell 28245 13803 14442
## 1786 Texas El Paso 831095 406117 424978
## 1787 Georgia Coweta 133416 65194 68222
## 1788 Oklahoma Pontotoc 38055 18595 19460
## 1789 Kentucky Harlan 28400 13877 14523
## 1790 Ohio Portage 161897 79107 82790
## 1791 Pennsylvania Washington 208226 101743 106483
## 1792 Illinois Edgar 17992 8791 9201
## 1793 Oklahoma Seminole 25481 12450 13031
## 1794 Massachusetts Franklin 71144 34759 36385
## 1795 South Carolina Chesterfield 46192 22565 23627
## 1796 Michigan Calhoun 134790 65845 68945
## 1797 Ohio Lawrence 61827 30202 31625
## 1798 North Carolina Jones 10166 4966 5200
## 1799 Alabama Baldwin 195121 95314 99807
## 1800 Illinois Iroquois 29053 14192 14861
## 1801 Texas Henderson 79016 38598 40418
## 1802 Kentucky Johnson 23350 11406 11944
## 1803 Indiana Hamilton 296635 144898 151737
## 1804 Virginia Chesapeake city 230601 112641 117960
## 1805 South Dakota Brown 38060 18591 19469
## 1806 Missouri Jasper 117184 57240 59944
## 1807 South Carolina Chester 32556 15902 16654
## 1808 South Carolina Lexington 273843 133755 140088
## 1809 Texas Shelby 25725 12565 13160
## 1810 Maine Hancock 54658 26696 27962
## 1811 Georgia Haralson 28565 13951 14614
## 1812 Iowa Black Hawk 132496 64707 67789
## 1813 Ohio Lake 229437 112045 117392
## 1814 Illinois Menard 12611 6158 6453
## 1815 Virginia King William 16097 7860 8237
## 1816 Vermont Chittenden 159711 77980 81731
## 1817 Kentucky Harrison 18648 9105 9543
## 1818 Louisiana St. Charles 52639 25699 26940
## 1819 Tennessee Sumner 169623 82811 86812
## 1820 Arkansas Poinsett 24210 11819 12391
## 1821 Louisiana Lafayette 231811 113159 118652
## 1822 Georgia Greene 16331 7972 8359
## 1823 Illinois Winnebago 290439 141776 148663
## 1824 Maine Lincoln 34156 16673 17483
## 1825 Iowa Hamilton 15297 7467 7830
## 1826 West Virginia Wayne 41499 20257 21242
## 1827 West Virginia Harrison 68998 33680 35318
## 1828 Georgia Long 16588 8097 8491
## 1829 Tennessee Cocke 35321 17241 18080
## 1830 Florida Highlands 98328 47994 50334
## 1831 Mississippi Simpson 27401 13373 14028
## 1832 Georgia Catoosa 65375 31905 33470
## 1833 North Carolina Cabarrus 188375 91929 96446
## 1834 Missouri Scott 39061 19062 19999
## 1835 Montana Glacier 13672 6672 7000
## 1836 Texas Grayson 122780 59917 62863
## 1837 Tennessee Greene 68576 33463 35113
## 1838 California Placer 366280 178729 187551
## 1839 Iowa Carroll 20629 10066 10563
## 1840 Georgia Decatur 27378 13359 14019
## 1841 New Mexico Taos 32943 16073 16870
## 1842 Illinois Wabash 11652 5685 5967
## 1843 Georgia Walton 86201 42055 44146
## 1844 Louisiana Calcasieu 195887 95567 100320
## 1845 Florida Hillsborough 1302884 635632 667252
## 1846 Indiana Randolph 25596 12487 13109
## 1847 Mississippi Yalobusha 12381 6040 6341
## 1848 Pennsylvania Westmoreland 361251 176227 185024
## 1849 Indiana Allen 363453 177301 186152
## 1850 Texas Camp 12516 6105 6411
## 1851 New Hampshire Strafford 125273 61104 64169
## 1852 California Contra Costa 1096068 534618 561450
## 1853 New York Cortland 49043 23921 25122
## 1854 North Carolina Richmond 46046 22459 23587
## 1855 New Jersey Somerset 330604 161252 169352
## 1856 Virginia Greene 18938 9237 9701
## 1857 Iowa Des Moines 40208 19611 20597
## 1858 Michigan Eaton 108341 52842 55499
## 1859 Tennessee Cumberland 57455 28022 29433
## 1860 Missouri Macon 15460 7540 7920
## 1861 Pennsylvania Blair 126448 61668 64780
## 1862 Florida St. Lucie 288006 140448 147558
## 1863 Texas Gillespie 25398 12385 13013
## 1864 Georgia Columbia 136204 66418 69786
## 1865 Tennessee Williamson 199456 97258 102198
## 1866 Kentucky Franklin 49778 24271 25507
## 1867 North Carolina Cumberland 324603 158270 166333
## 1868 Arkansas Lawrence 17029 8303 8726
## 1869 Arkansas Polk 20364 9929 10435
## 1870 Texas Sabine 10440 5090 5350
## 1871 South Carolina Newberry 37690 18375 19315
## 1872 West Virginia Brooke 23665 11537 12128
## 1873 Ohio Gallia 30565 14900 15665
## 1874 Georgia McIntosh 14007 6828 7179
## 1875 Massachusetts Middlesex 1556116 758542 797574
## 1876 Kentucky Warren 118950 57981 60969
## 1877 Texas McLennan 241505 117708 123797
## 1878 Alabama Elmore 80763 39362 41401
## 1879 Georgia Jones 28738 14006 14732
## 1880 Texas Jackson 14486 7060 7426
## 1881 Oklahoma Tulsa 623335 303789 319546
## 1882 Tennessee Carroll 28353 13818 14535
## 1883 Florida Volusia 503719 245481 258238
## 1884 Texas Hidalgo 819217 399234 419983
## 1885 Ohio Trumbull 206373 100568 105805
## 1886 Tennessee Franklin 41138 20046 21092
## 1887 Georgia Carroll 112595 54865 57730
## 1888 Arkansas Boone 37227 18138 19089
## 1889 Tennessee Hardin 25900 12619 13281
## 1890 Tennessee Hamblen 62999 30693 32306
## 1891 New York Yates 25187 12271 12916
## 1892 North Carolina Davie 41447 20192 21255
## 1893 Ohio Franklin 1215761 592278 623483
## 1894 Louisiana Iberia 73938 36017 37921
## 1895 Illinois Adams 67081 32676 34405
## 1896 Massachusetts Plymouth 503681 245337 258344
## 1897 Alabama Shelby 203530 99134 104396
## 1898 Kentucky Anderson 21761 10599 11162
## 1899 New Mexico Santa Fe 147108 71649 75459
## 1900 Pennsylvania Clarion 39454 19216 20238
## 1901 New Jersey Union 548744 267263 281481
## 1902 Missouri Greene 283206 137929 145277
## 1903 Ohio Crawford 42725 20808 21917
## 1904 Mississippi Lafayette 51169 24919 26250
## 1905 Kentucky Graves 37502 18263 19239
## 1906 Oregon Curry 22338 10878 11460
## 1907 Texas Webb 263251 128182 135069
## 1908 Mississippi Tate 28415 13835 14580
## 1909 Florida Charlotte 165783 80718 85065
## 1910 Kentucky Knox 31809 15487 16322
## 1911 California Yolo 207320 100937 106383
## 1912 Connecticut Fairfield 939983 457634 482349
## 1913 Mississippi Marion 26109 12711 13398
## 1914 Illinois McLean 173114 84278 88836
## 1915 Pennsylvania Lehigh 356756 173680 183076
## 1916 Kentucky Barren 42925 20895 22030
## 1917 Arkansas Sharp 17055 8302 8753
## 1918 Arkansas Craighead 101409 49362 52047
## 1919 Georgia White 27791 13527 14264
## 1920 Oklahoma Stephens 44806 21808 22998
## 1921 Virginia Accomack 33115 16117 16998
## 1922 West Virginia Cabell 96824 47123 49701
## 1923 New Jersey Passaic 507574 247026 260548
## 1924 Mississippi Wayne 20564 10008 10556
## 1925 New Jersey Monmouth 629185 306200 322985
## 1926 Tennessee Decatur 11686 5687 5999
## 1927 West Virginia Barbour 16731 8142 8589
## 1928 North Carolina Wake 976019 474967 501052
## 1929 Oregon Jackson 208363 101395 106968
## 1930 Arkansas Howard 13555 6596 6959
## 1931 Tennessee Bradley 102062 49664 52398
## 1932 Oklahoma Bryan 44003 21412 22591
## 1933 New Jersey Cape May 95805 46617 49188
## 1934 Oregon Lincoln 46347 22551 23796
## 1935 New York Otsego 61399 29873 31526
## 1936 Tennessee Loudon 50229 24438 25791
## 1937 Tennessee Weakley 34415 16743 17672
## 1938 Pennsylvania Clinton 39614 19271 20343
## 1939 Tennessee Coffee 53448 26000 27448
## 1940 New York Niagara 214150 104173 109977
## 1941 South Carolina Dorchester 145715 70881 74834
## 1942 Tennessee Washington 125317 60955 64362
## 1943 South Carolina Williamsburg 33238 16166 17072
## 1944 Michigan Macomb 854689 415691 438998
## 1945 Alabama Pickens 19856 9657 10199
## 1946 Mississippi Union 27811 13525 14286
## 1947 North Carolina Yancey 17604 8561 9043
## 1948 Georgia Dade 16445 7997 8448
## 1949 Louisiana Franklin 20550 9993 10557
## 1950 South Dakota Hughes 17466 8493 8973
## 1951 Arkansas Mississippi 44864 21815 23049
## 1952 South Carolina Horry 290730 141358 149372
## 1953 Alabama Russell 58302 28347 29955
## 1954 Ohio Mahoning 234550 114040 120510
## 1955 New Jersey Warren 107226 52134 55092
## 1956 Florida St. Johns 210495 102338 108157
## 1957 North Carolina Person 39262 19088 20174
## 1958 North Carolina Brunswick 115926 56359 59567
## 1959 Delaware Sussex 207302 100782 106520
## 1960 Maine York 199682 97075 102607
## 1961 New Hampshire Cheshire 76430 37155 39275
## 1962 Louisiana St. John the Baptist 44161 21467 22694
## 1963 Mississippi Lamar 58885 28624 30261
## 1964 Tennessee Knox 444348 215996 228352
## 1965 Mississippi Perry 12187 5924 6263
## 1966 New Jersey Salem 65120 31653 33467
## 1967 Louisiana Assumption 23057 11206 11851
## 1968 Texas Trinity 14405 7001 7404
## 1969 South Carolina Laurens 66389 32264 34125
## 1970 Kansas Lyon 33462 16262 17200
## 1971 Georgia Fulton 983903 478161 505742
## 1972 South Carolina Greenville 474903 230780 244123
## 1973 Missouri Stoddard 29837 14499 15338
## 1974 Maine Cumberland 286119 139036 147083
## 1975 Minnesota Ramsey 527411 256288 271123
## 1976 Tennessee McNairy 26103 12684 13419
## 1977 Ohio Muskingum 86016 41796 44220
## 1978 Tennessee McMinn 52506 25513 26993
## 1979 Illinois Warren 17701 8601 9100
## 1980 Kentucky Bell 27950 13581 14369
## 1981 Kentucky Hopkins 46518 22603 23915
## 1982 New Jersey Gloucester 290298 141053 149245
## 1983 New York Schenectady 154796 75210 79586
## 1984 Louisiana Lincoln 47349 23005 24344
## 1985 Mississippi Jasper 16588 8058 8530
## 1986 Pennsylvania Montgomery 812970 394888 418082
## 1987 Georgia Fannin 23742 11531 12211
## 1988 Kansas Barton 27399 13307 14092
## 1989 Louisiana St. Helena 10818 5254 5564
## 1990 Illinois Saline 24783 12036 12747
## 1991 Michigan Oakland 1229503 597097 632406
## 1992 Virginia Northampton 12184 5917 6267
## 1993 Nebraska Dakota 20798 10100 10698
## 1994 Missouri Christian 80904 39288 41616
## 1995 Alabama Walker 65923 32013 33910
## 1996 Maryland Worcester 51519 25018 26501
## 1997 Mississippi Lauderdale 79832 38764 41068
## 1998 South Carolina Richland 397899 193206 204693
## 1999 Louisiana Webster 40617 19722 20895
## 2000 Louisiana St. Tammany 242960 117970 124990
## 2001 Florida Pasco 479288 232719 246569
## 2002 Mississippi DeSoto 168586 81857 86729
## 2003 South Carolina Spartanburg 291240 141396 149844
## 2004 North Carolina Rutherford 66865 32462 34403
## 2005 Kentucky Greenup 36477 17709 18768
## 2006 Florida Broward 1843152 894820 948332
## 2007 Louisiana Jefferson 435092 211216 223876
## 2008 Missouri Taney 53555 25998 27557
## 2009 Michigan Ingham 283491 137614 145877
## 2010 Texas Red River 12567 6100 6467
## 2011 Texas Dimmit 10682 5185 5497
## 2012 Kansas Shawnee 178792 86783 92009
## 2013 Texas Lamar 49566 24057 25509
## 2014 North Carolina Transylvania 32928 15981 16947
## 2015 Texas Cass 30328 14717 15611
## 2016 Virginia Nelson 14858 7210 7648
## 2017 Kentucky Taylor 24993 12128 12865
## 2018 North Carolina Montgomery 27601 13393 14208
## 2019 Louisiana Tangipahoa 125486 60890 64596
## 2020 Georgia Fayette 108655 52720 55935
## 2021 Tennessee Henry 32269 15657 16612
## 2022 Kentucky Madison 85838 41648 44190
## 2023 Maine Kennebec 121112 58759 62353
## 2024 South Carolina Cherokee 55863 27102 28761
## 2025 Indiana Wayne 67866 32924 34942
## 2026 North Carolina Macon 33919 16455 17464
## 2027 Florida Miami-Dade 2639042 1280221 1358821
## 2028 Ohio Stark 374979 181905 193074
## 2029 Virginia Shenandoah 42724 20725 21999
## 2030 Virginia Appomattox 15208 7377 7831
## 2031 Tennessee Anderson 75430 36585 38845
## 2032 Ohio Lucas 436261 211587 224674
## 2033 Alabama Talladega 81437 39494 41943
## 2034 Michigan Isabella 70669 34271 36398
## 2035 Georgia Houston 147570 71562 76008
## 2036 Rhode Island Washington 126405 61286 65119
## 2037 Maryland Caroline 32661 15835 16826
## 2038 New York Nassau 1354612 656738 697874
## 2039 Indiana Floyd 75900 36794 39106
## 2040 South Carolina Kershaw 62722 30405 32317
## 2041 Kentucky Daviess 98173 47590 50583
## 2042 New York Queens 2301139 1115459 1185680
## 2043 Texas Starr 62648 30368 32280
## 2044 Connecticut Hartford 896943 434784 462159
## 2045 Florida Duval 890673 431732 458941
## 2046 Michigan Saginaw 196479 95216 101263
## 2047 Ohio Jefferson 68053 32976 35077
## 2048 Illinois Cook 5236393 2537245 2699148
## 2049 Virginia Campbell 55012 26655 28357
## 2050 Virginia James City 70673 34241 36432
## 2051 Mississippi Winston 18697 9058 9639
## 2052 Texas Jasper 35768 17328 18440
## 2053 South Carolina Charleston 372904 180654 192250
## 2054 Illinois Peoria 187112 90646 96466
## 2055 Illinois Clark 16159 7828 8331
## 2056 Mississippi Scott 28293 13706 14587
## 2057 Indiana St. Joseph 267246 129457 137789
## 2058 Oklahoma Carter 48442 23465 24977
## 2059 Ohio Summit 541847 262467 279380
## 2060 Oklahoma Choctaw 15120 7324 7796
## 2061 North Carolina Gaston 209807 101627 108180
## 2062 Illinois White 14464 7006 7458
## 2063 North Carolina Henderson 109719 53143 56576
## 2064 Alabama Autauga 55221 26745 28476
## 2065 Pennsylvania Beaver 169785 82231 87554
## 2066 Louisiana Richland 20777 10062 10715
## 2067 Georgia Candler 11031 5342 5689
## 2068 Tennessee Campbell 40176 19455 20721
## 2069 Georgia Cobb 719133 348219 370914
## 2070 Georgia Floyd 96169 46567 49602
## 2071 Massachusetts Bristol 552763 267650 285113
## 2072 Tennessee Blount 125188 60612 64576
## 2073 Rhode Island Providence 630459 305247 325212
## 2074 Tennessee Sullivan 156752 75894 80858
## 2075 Virginia Newport News city 181323 87790 93533
## 2076 Georgia Richmond 201291 97446 103845
## 2077 Tennessee Dyer 38054 18422 19632
## 2078 Missouri Cape Girardeau 77606 37569 40037
## 2079 Georgia Peach 27086 13112 13974
## 2080 Mississippi Jones 68276 33051 35225
## 2081 Alabama Covington 37886 18339 19547
## 2082 New York Richmond 472481 228703 243778
## 2083 Illinois Stephenson 46625 22568 24057
## 2084 Kentucky Woodford 25317 12254 13063
## 2085 Kentucky Henderson 46396 22456 23940
## 2086 Alabama Franklin 31634 15311 16323
## 2087 North Carolina Haywood 59170 28638 30532
## 2088 Virginia Madison 13147 6363 6784
## 2089 North Carolina Robeson 134871 65273 69598
## 2090 West Virginia Hancock 30201 14616 15585
## 2091 Arkansas Garland 96954 46921 50033
## 2092 Ohio Clark 136827 66217 70610
## 2093 South Carolina Aiken 163908 79318 84590
## 2094 Missouri Howell 40326 19514 20812
## 2095 New York Albany 307463 148776 158687
## 2096 Alabama Etowah 103766 50207 53559
## 2097 South Carolina Calhoun 14958 7237 7721
## 2098 Delaware New Castle 549643 265915 283728
## 2099 Pennsylvania Montour 18508 8954 9554
## 2100 New Jersey Atlantic 275376 133217 142159
## 2101 North Carolina Northampton 21011 10164 10847
## 2102 Pennsylvania Dauphin 271094 131140 139954
## 2103 Florida Alachua 254218 122968 131250
## 2104 Texas Lavaca 19549 9456 10093
## 2105 Florida Seminole 437346 211546 225800
## 2106 New York Erie 921584 445761 475823
## 2107 New Jersey Bergen 926330 448049 478281
## 2108 West Virginia Mason 27177 13144 14033
## 2109 Missouri Jackson 680905 329297 351608
## 2110 Indiana Lake 491596 237712 253884
## 2111 Georgia Spalding 63873 30884 32989
## 2112 Indiana Vanderburgh 181305 87661 93644
## 2113 Georgia Jefferson 16374 7916 8458
## 2114 Florida Palm Beach 1378806 666577 712229
## 2115 Missouri Vernon 20878 10093 10785
## 2116 West Virginia Wood 86559 41844 44715
## 2117 Alabama Lamar 14133 6832 7301
## 2118 Florida Lake 310561 150126 160435
## 2119 Missouri St. Louis city 317850 153641 164209
## 2120 Texas Smith 217552 105157 112395
## 2121 Arkansas Union 40633 19640 20993
## 2122 Missouri Boone 170770 82533 88237
## 2123 Massachusetts Suffolk 758919 366767 392152
## 2124 Oklahoma Washington 51760 25011 26749
## 2125 Missouri Butler 42951 20753 22198
## 2126 New Jersey Camden 511998 247370 264628
## 2127 Florida Manatee 343729 166067 177662
## 2128 Alabama Colbert 54444 26303 28141
## 2129 New York Monroe 749356 362021 387335
## 2130 Wisconsin Milwaukee 955939 461804 494135
## 2131 Louisiana Rapides 132225 63862 68363
## 2132 Louisiana Acadia 62163 30023 32140
## 2133 New York Onondaga 468304 226176 242128
## 2134 Arkansas Hempstead 22336 10787 11549
## 2135 Virginia Charlottesville city 45084 21771 23313
## 2136 Rhode Island Kent 164958 79650 85308
## 2137 Pennsylvania Lackawanna 213459 103058 110401
## 2138 Alabama Tuscaloosa 200458 96781 103677
## 2139 New York Westchester 967315 466996 500319
## 2140 Kentucky Lincoln 24498 11827 12671
## 2141 Georgia Chatham 279290 134833 144457
## 2142 Pennsylvania Lawrence 89162 43042 46120
## 2143 Arkansas Baxter 41040 19811 21229
## 2144 Virginia Mecklenburg 31555 15232 16323
## 2145 North Carolina Rockingham 92300 44553 47747
## 2146 Kentucky Caldwell 12826 6191 6635
## 2147 Georgia Towns 10800 5213 5587
## 2148 Maryland Charles 152754 73731 79023
## 2149 Texas Panola 23900 11536 12364
## 2150 South Carolina York 240076 115873 124203
## 2151 South Carolina Anderson 191215 92287 98928
## 2152 Massachusetts Berkshire 129288 62393 66895
## 2153 Virginia Alexandria city 149315 72057 77258
## 2154 Kentucky Jefferson 755809 364719 391090
## 2155 Louisiana Vermilion 59110 28523 30587
## 2156 Virginia Northumberland 12304 5937 6367
## 2157 North Carolina Polk 20327 9808 10519
## 2158 Georgia Worth 21156 10208 10948
## 2159 Florida Citrus 139654 67381 72273
## 2160 Mississippi Rankin 146761 70809 75952
## 2161 Nebraska Scotts Bluff 36684 17699 18985
## 2162 Mississippi Panola 34373 16584 17789
## 2163 Texas Kerr 50149 24194 25955
## 2164 Alabama Calhoun 116648 56274 60374
## 2165 Illinois Massac 15016 7244 7772
## 2166 Tennessee Maury 84089 40561 43528
## 2167 Texas Cameron 417947 201583 216364
## 2168 North Carolina New Hanover 213091 102777 110314
## 2169 Kentucky Calloway 38106 18379 19727
## 2170 Georgia Tift 40787 19671 21116
## 2171 Indiana Howard 82765 39916 42849
## 2172 Virginia Hampton city 137081 66111 70970
## 2173 Tennessee Davidson 658506 317582 340924
## 2174 Missouri Dunklin 31562 15220 16342
## 2175 New Mexico McKinley 73998 35681 38317
## 2176 California Madera 153187 73863 79324
## 2177 Georgia Stephens 25620 12352 13268
## 2178 Mississippi Tishomingo 19539 9420 10119
## 2179 Connecticut New Haven 862224 415679 446545
## 2180 Illinois St. Clair 267029 128729 138300
## 2181 Delaware Kent 169509 81716 87793
## 2182 Michigan Genesee 415874 200473 215401
## 2183 Tennessee Hamilton 348121 167812 180309
## 2184 Mississippi Copiah 28921 13941 14980
## 2185 Massachusetts Hampden 468041 225611 242430
## 2186 Georgia Troup 68867 33189 35678
## 2187 Indiana Marion 926335 446372 479963
## 2188 Illinois Coles 53037 25550 27487
## 2189 Mississippi Calhoun 14789 7124 7665
## 2190 Virginia Suffolk city 86184 41511 44673
## 2191 Mississippi Smith 16257 7830 8427
## 2192 Louisiana St. James 21650 10426 11224
## 2193 Massachusetts Essex 763849 367791 396058
## 2194 West Virginia Kanawha 190781 91860 98921
## 2195 Maryland Montgomery 1017859 490093 527766
## 2196 Kentucky Shelby 44290 21324 22966
## 2197 North Carolina Mecklenburg 990288 476765 513523
## 2198 Maine Sagadahoc 35092 16894 18198
## 2199 Georgia Cook 17033 8200 8833
## 2200 Virginia Radford city 17057 8211 8846
## 2201 Virginia Amherst 32148 15475 16673
## 2202 Florida Indian River 142866 68768 74098
## 2203 Virginia Chesterfield 328176 157960 170216
## 2204 Pennsylvania Allegheny 1231145 592511 638634
## 2205 Georgia Burke 23007 11071 11936
## 2206 Virginia Roanoke city 98736 47511 51225
## 2207 Maryland Prince George's 892816 429603 463213
## 2208 Iowa Montgomery 10465 5035 5430
## 2209 Mississippi Chickasaw 17391 8367 9024
## 2210 South Dakota Oglala Lakota 14153 6809 7344
## 2211 Oregon Polk 77264 37171 40093
## 2212 Pennsylvania Columbia 66912 32181 34731
## 2213 North Carolina Buncombe 247336 118950 128386
## 2214 Ohio Montgomery 533763 256680 277083
## 2215 Rhode Island Bristol 49176 23646 25530
## 2216 Virginia Portsmouth city 96135 46221 49914
## 2217 Georgia Douglas 136520 65637 70883
## 2218 Virginia Henry 52580 25278 27302
## 2219 North Carolina Swain 14163 6808 7355
## 2220 Ohio Hamilton 804194 386561 417633
## 2221 Michigan Wayne 1778969 855112 923857
## 2222 Tennessee Henderson 28013 13465 14548
## 2223 North Carolina Nash 94722 45521 49201
## 2224 Texas Morris 12700 6103 6597
## 2225 Texas Kendall 37361 17951 19410
## 2226 Iowa Poweshiek 18705 8987 9718
## 2227 Florida Hernando 174809 83978 90831
## 2228 New Jersey Ocean 583450 280258 303192
## 2229 New Jersey Essex 791609 380222 411387
## 2230 Minnesota Waseca 19076 9162 9914
## 2231 North Carolina Chowan 14656 7039 7617
## 2232 North Carolina Moore 91743 44062 47681
## 2233 Missouri Marion 28840 13850 14990
## 2234 Illinois Sangamon 199016 95552 103464
## 2235 Tennessee Obion 31129 14945 16184
## 2236 Florida Marion 336811 161696 175115
## 2237 Alabama Conecuh 12865 6176 6689
## 2238 Louisiana De Soto 26965 12944 14021
## 2239 South Carolina Sumter 107777 51732 56045
## 2240 Arkansas Cross 17467 8384 9083
## 2241 Mississippi Neshoba 29553 14185 15368
## 2242 Louisiana St. Landry 83613 40129 43484
## 2243 Pennsylvania Delaware 561683 269512 292171
## 2244 Indiana Delaware 117335 56294 61041
## 2245 Arkansas Ashley 21229 10185 11044
## 2246 Mississippi Noxubee 11143 5346 5797
## 2247 Mississippi Covington 19471 9341 10130
## 2248 North Carolina Halifax 53407 25620 27787
## 2249 North Carolina Durham 288817 138537 150280
## 2250 Arkansas Pulaski 390463 187292 203171
## 2251 West Virginia Ohio 43637 20931 22706
## 2252 Florida Pinellas 931477 446740 484737
## 2253 Kansas Atchison 16633 7977 8656
## 2254 Louisiana East Baton Rouge 444690 213266 231424
## 2255 Missouri New Madrid 18411 8828 9583
## 2256 North Carolina Cleveland 97178 46587 50591
## 2257 Arizona Santa Cruz 47073 22566 24507
## 2258 Missouri Grundy 10256 4916 5340
## 2259 Kentucky Rowan 23608 11316 12292
## 2260 Washington Asotin 22040 10564 11476
## 2261 Texas Nacogdoches 65531 31409 34122
## 2262 Illinois Macon 109193 52336 56857
## 2263 Georgia Henry 211512 101340 110172
## 2264 Massachusetts Norfolk 687721 329431 358290
## 2265 Mississippi Lincoln 34765 16653 18112
## 2266 Washington San Juan 15956 7643 8313
## 2267 Louisiana Bienville 13977 6695 7282
## 2268 Alabama Lauderdale 92737 44419 48318
## 2269 Virginia Falls Church city 13308 6374 6934
## 2270 Tennessee Chester 17324 8297 9027
## 2271 Alabama Monroe 22217 10639 11578
## 2272 Kentucky McCracken 65408 31320 34088
## 2273 Indiana Grant 68896 32990 35906
## 2274 South Carolina Union 28125 13467 14658
## 2275 South Carolina Colleton 38004 18196 19808
## 2276 Georgia Upson 26528 12701 13827
## 2277 Virginia Albemarle 103108 49365 53743
## 2278 Alabama Henry 17252 8259 8993
## 2279 Alabama Houston 103534 49563 53971
## 2280 Indiana Jefferson 32453 15535 16918
## 2281 Louisiana Ouachita 155769 74555 81214
## 2282 South Carolina Fairfield 23108 11060 12048
## 2283 Florida Flagler 100783 48229 52554
## 2284 Alabama Wilcox 11235 5376 5859
## 2285 Georgia Bleckley 12746 6099 6647
## 2286 Alabama Mobile 414251 198216 216035
## 2287 Iowa Union 12575 6015 6560
## 2288 Louisiana Orleans 376738 180183 196555
## 2289 Georgia Meriwether 21297 10185 11112
## 2290 Mississippi Madison 100202 47918 52284
## 2291 Texas Marion 10248 4900 5348
## 2292 Louisiana Pointe Coupee 22498 10757 11741
## 2293 Virginia Roanoke 93633 44757 48876
## 2294 Maryland Dorchester 32534 15542 16992
## 2295 Louisiana Morehouse 27012 12904 14108
## 2296 Georgia Ben Hill 17477 8349 9128
## 2297 Georgia Lamar 18114 8653 9461
## 2298 Florida Sarasota 392038 187242 204796
## 2299 Mississippi Leflore 31516 15051 16465
## 2300 Mississippi Monroe 36175 17273 18902
## 2301 North Carolina Orange 138644 66199 72445
## 2302 Ohio Union 53470 25523 27947
## 2303 Louisiana Natchitoches 39330 18765 20565
## 2304 Texas Llano 19323 9219 10104
## 2305 Alabama Lowndes 10742 5125 5617
## 2306 Alabama Pike 33155 15818 17337
## 2307 Arkansas Arkansas 18731 8936 9795
## 2308 Alabama Chambers 34079 16258 17821
## 2309 Mississippi Lowndes 59699 28469 31230
## 2310 South Carolina Bamberg 15432 7359 8073
## 2311 Tennessee Shelby 937750 447161 490589
## 2312 North Carolina Clay 10656 5081 5575
## 2313 West Virginia Mercer 61891 29509 32382
## 2314 North Carolina Alamance 155258 74019 81239
## 2315 Virginia Waynesboro city 21150 10083 11067
## 2316 Oklahoma Pottawatomie 71136 33906 37230
## 2317 Georgia DeKalb 716331 341362 374969
## 2318 Georgia Morgan 17900 8530 9370
## 2319 Mississippi Holmes 18772 8945 9827
## 2320 Alabama Choctaw 13395 6382 7013
## 2321 Maryland Wicomico 101182 48199 52983
## 2322 Georgia Laurens 47886 22810 25076
## 2323 Mississippi Lee 85036 40489 44547
## 2324 North Carolina Beaufort 47561 22645 24916
## 2325 Alabama Tallapoosa 41153 19593 21560
## 2326 Missouri Linn 12401 5904 6497
## 2327 Virginia Halifax 35506 16904 18602
## 2328 Massachusetts Barnstable 214766 102244 112522
## 2329 Georgia McDuffie 21582 10274 11308
## 2330 Virginia Harrisonburg city 51388 24462 26926
## 2331 Mississippi Jefferson Davis 11941 5684 6257
## 2332 Mississippi Grenada 21591 10275 11316
## 2333 North Carolina Lenoir 58782 27973 30809
## 2334 Georgia Clayton 267234 127166 140068
## 2335 Virginia Richmond city 213735 101702 112033
## 2336 South Carolina Georgetown 60572 28819 31753
## 2337 Ohio Cuyahoga 1263189 600864 662325
## 2338 Mississippi Forrest 76267 36274 39993
## 2339 Tennessee Gibson 49572 23577 25995
## 2340 Mississippi Warren 48020 22838 25182
## 2341 Arkansas Columbia 24327 11569 12758
## 2342 Louisiana Caddo 254742 121129 133613
## 2343 North Carolina Wilson 81581 38776 42805
## 2344 Georgia Newton 102645 48786 53859
## 2345 Florida Leon 282940 134474 148466
## 2346 Missouri Adair 25560 12147 13413
## 2347 Georgia Clarke 120905 57448 63457
## 2348 North Carolina Guilford 506763 240753 266010
## 2349 North Carolina Bladen 34720 16491 18229
## 2350 Alabama Montgomery 228138 108296 119842
## 2351 North Carolina Forsyth 361684 171662 190022
## 2352 North Carolina Perquimans 13498 6402 7096
## 2353 Mississippi Pike 40075 19003 21072
## 2354 Arkansas Clark 22751 10786 11965
## 2355 Georgia Elbert 19537 9262 10275
## 2356 Georgia Sumter 31429 14894 16535
## 2357 Georgia Thomas 44824 21239 23585
## 2358 Florida Gadsden 46424 21995 24429
## 2359 Colorado Yuma 10185 4825 5360
## 2360 Virginia Salem city 25165 11919 13246
## 2361 New York Kings 2595259 1229001 1366258
## 2362 Tennessee Madison 98184 46495 51689
## 2363 Missouri St. Louis 1001327 474150 527177
## 2364 Georgia Rockdale 86901 41147 45754
## 2365 Missouri Pemiscot 17837 8445 9392
## 2366 Indiana Parke 17107 8098 9009
## 2367 Virginia Hopewell city 22279 10546 11733
## 2368 Georgia Crisp 23314 11035 12279
## 2369 South Carolina Dillon 31435 14878 16557
## 2370 Maryland Baltimore 822959 389418 433541
## 2371 Mississippi Clarke 16362 7740 8622
## 2372 Tennessee Crockett 14600 6906 7694
## 2373 Alabama Jefferson 659026 311581 347445
## 2374 Georgia Glynn 81743 38647 43096
## 2375 Georgia Habersham 43527 20573 22954
## 2376 South Carolina Darlington 67922 32091 35831
## 2377 Pennsylvania Philadelphia 1555072 734521 820551
## 2378 New York New York 1629507 769434 860073
## 2379 Alabama Clarke 25070 11834 13236
## 2380 Virginia Henrico 318864 150466 168398
## 2381 Arkansas Crittenden 49765 23482 26283
## 2382 North Carolina Pitt 173798 82005 91793
## 2383 Mississippi Newton 21663 10217 11446
## 2384 Georgia Bibb 154816 73011 81805
## 2385 North Carolina Martin 23729 11186 12543
## 2386 Maryland Baltimore city 622454 293366 329088
## 2387 Maryland Talbot 37799 17811 19988
## 2388 South Carolina Orangeburg 90575 42660 47915
## 2389 Maryland Kent 19923 9382 10541
## 2390 Virginia Lancaster 11129 5240 5889
## 2391 Alabama Hale 15256 7183 8073
## 2392 New York Bronx 1428357 672447 755910
## 2393 Arkansas Ouachita 25044 11790 13254
## 2394 Mississippi Attala 19175 9024 10151
## 2395 Texas Falls 17410 8192 9218
## 2396 Virginia Lynchburg city 78158 36717 41441
## 2397 Mississippi Bolivar 33803 15874 17929
## 2398 Mississippi Tunica 10477 4916 5561
## 2399 Mississippi Montgomery 10464 4907 5557
## 2400 Mississippi Washington 49499 23204 26295
## 2401 Mississippi Hinds 245874 115233 130641
## 2402 Georgia Toombs 27210 12743 14467
## 2403 Massachusetts Hampshire 160759 75286 85473
## 2404 Tennessee Haywood 18248 8544 9704
## 2405 North Carolina Vance 44829 20984 23845
## 2406 South Carolina Florence 138330 64688 73642
## 2407 Arkansas Phillips 20391 9529 10862
## 2408 Alabama Butler 20354 9502 10852
## 2409 South Carolina Greenwood 69771 32567 37204
## 2410 Virginia Fredericksburg city 27395 12783 14612
## 2411 South Carolina Barnwell 22098 10290 11808
## 2412 Virginia Bristol city 17524 8158 9366
## 2413 Alabama Marengo 20306 9452 10854
## 2414 Georgia Dougherty 93310 43432 49878
## 2415 Mississippi Clay 20279 9435 10844
## 2416 Virginia Colonial Heights city 17515 8142 9373
## 2417 North Carolina Edgecombe 55280 25693 29587
## 2418 Georgia Evans 10814 5024 5790
## 2419 Arkansas Desha 12379 5741 6638
## 2420 Alabama Perry 10038 4651 5387
## 2421 Georgia Early 10579 4899 5680
## 2422 Virginia Essex 11151 5158 5993
## 2423 Mississippi Coahoma 25254 11681 13573
## 2424 Alabama Dallas 42154 19450 22704
## 2425 North Carolina Washington 12668 5828 6840
## 2426 Virginia Petersburg city 32123 14773 17350
## 2427 South Carolina Marion 32167 14778 17389
## 2428 Virginia Fluvanna 26014 11924 14090
## 2429 Alabama Macon 20018 9166 10852
## 2430 Virginia Danville city 42450 19411 23039
## 2431 West Virginia Summers 13544 6185 7359
## 2432 Missouri Audrain 25783 11730 14053
## 2433 Virginia Martinsville city 13624 6155 7469
## 2434 Virginia Williamsburg city 14754 6665 8089
## 2435 Missouri Livingston 15042 6787 8255
## 2436 Virginia Staunton city 24193 10861 13332
## 2437 Alabama Sumter 13341 5905 7436
## 2438 Georgia Pulaski 11590 4866 6724
## proportion_men
## 1 0.6852664
## 2 0.6683412
## 3 0.6664428
## 4 0.6635096
## 5 0.6470937
## 6 0.6332966
## 7 0.6321389
## 8 0.6249458
## 9 0.6210034
## 10 0.6124320
## 11 0.6112928
## 12 0.6038764
## 13 0.6023619
## 14 0.5998800
## 15 0.5954275
## 16 0.5951589
## 17 0.5913704
## 18 0.5902136
## 19 0.5850491
## 20 0.5847953
## 21 0.5839319
## 22 0.5788255
## 23 0.5768762
## 24 0.5763069
## 25 0.5761649
## 26 0.5742234
## 27 0.5723902
## 28 0.5695578
## 29 0.5686432
## 30 0.5676776
## 31 0.5664510
## 32 0.5663861
## 33 0.5651515
## 34 0.5644968
## 35 0.5634119
## 36 0.5609382
## 37 0.5600125
## 38 0.5578900
## 39 0.5574929
## 40 0.5568216
## 41 0.5565147
## 42 0.5560206
## 43 0.5557755
## 44 0.5551910
## 45 0.5548438
## 46 0.5547875
## 47 0.5546830
## 48 0.5539035
## 49 0.5514617
## 50 0.5509555
## 51 0.5506250
## 52 0.5500977
## 53 0.5498272
## 54 0.5497941
## 55 0.5496618
## 56 0.5488540
## 57 0.5484001
## 58 0.5482818
## 59 0.5479140
## 60 0.5472411
## 61 0.5468165
## 62 0.5467670
## 63 0.5466516
## 64 0.5466345
## 65 0.5465288
## 66 0.5459051
## 67 0.5446497
## 68 0.5445854
## 69 0.5435337
## 70 0.5431775
## 71 0.5429961
## 72 0.5424221
## 73 0.5421835
## 74 0.5421713
## 75 0.5418414
## 76 0.5416293
## 77 0.5414757
## 78 0.5413529
## 79 0.5409380
## 80 0.5408069
## 81 0.5406977
## 82 0.5404425
## 83 0.5402019
## 84 0.5399146
## 85 0.5394846
## 86 0.5394472
## 87 0.5391504
## 88 0.5391001
## 89 0.5387404
## 90 0.5384320
## 91 0.5384291
## 92 0.5382816
## 93 0.5380133
## 94 0.5378221
## 95 0.5376388
## 96 0.5374198
## 97 0.5371947
## 98 0.5371245
## 99 0.5368768
## 100 0.5365331
## 101 0.5363522
## 102 0.5360565
## 103 0.5359692
## 104 0.5359155
## 105 0.5352167
## 106 0.5351466
## 107 0.5350834
## 108 0.5349581
## 109 0.5348103
## 110 0.5344593
## 111 0.5344191
## 112 0.5341090
## 113 0.5340886
## 114 0.5335508
## 115 0.5333476
## 116 0.5332036
## 117 0.5331332
## 118 0.5327901
## 119 0.5323549
## 120 0.5310950
## 121 0.5310727
## 122 0.5309632
## 123 0.5305823
## 124 0.5304217
## 125 0.5304131
## 126 0.5303320
## 127 0.5302109
## 128 0.5301378
## 129 0.5301240
## 130 0.5300618
## 131 0.5297929
## 132 0.5296249
## 133 0.5295653
## 134 0.5295090
## 135 0.5294619
## 136 0.5294237
## 137 0.5291094
## 138 0.5291009
## 139 0.5290971
## 140 0.5289940
## 141 0.5288492
## 142 0.5285089
## 143 0.5284615
## 144 0.5283722
## 145 0.5277350
## 146 0.5271739
## 147 0.5270030
## 148 0.5269368
## 149 0.5265921
## 150 0.5264831
## 151 0.5261470
## 152 0.5260464
## 153 0.5260160
## 154 0.5256343
## 155 0.5255952
## 156 0.5255599
## 157 0.5254566
## 158 0.5254414
## 159 0.5253446
## 160 0.5252918
## 161 0.5252822
## 162 0.5250271
## 163 0.5248528
## 164 0.5247691
## 165 0.5246308
## 166 0.5246058
## 167 0.5245973
## 168 0.5244272
## 169 0.5242114
## 170 0.5239756
## 171 0.5239334
## 172 0.5238287
## 173 0.5238276
## 174 0.5235329
## 175 0.5232634
## 176 0.5231517
## 177 0.5229947
## 178 0.5229878
## 179 0.5228763
## 180 0.5228104
## 181 0.5226911
## 182 0.5226050
## 183 0.5223119
## 184 0.5222401
## 185 0.5222361
## 186 0.5220009
## 187 0.5218087
## 188 0.5217922
## 189 0.5217323
## 190 0.5217126
## 191 0.5215514
## 192 0.5213063
## 193 0.5212531
## 194 0.5212382
## 195 0.5211758
## 196 0.5211168
## 197 0.5210358
## 198 0.5209993
## 199 0.5209109
## 200 0.5207413
## 201 0.5207169
## 202 0.5206983
## 203 0.5206695
## 204 0.5204873
## 205 0.5201897
## 206 0.5201440
## 207 0.5200542
## 208 0.5196154
## 209 0.5195396
## 210 0.5193536
## 211 0.5191634
## 212 0.5190653
## 213 0.5190373
## 214 0.5189011
## 215 0.5187897
## 216 0.5187849
## 217 0.5186055
## 218 0.5185934
## 219 0.5185811
## 220 0.5183810
## 221 0.5183115
## 222 0.5182827
## 223 0.5181774
## 224 0.5181150
## 225 0.5180965
## 226 0.5180031
## 227 0.5179666
## 228 0.5178788
## 229 0.5178339
## 230 0.5177893
## 231 0.5177740
## 232 0.5177735
## 233 0.5176723
## 234 0.5174872
## 235 0.5173886
## 236 0.5173870
## 237 0.5173583
## 238 0.5171760
## 239 0.5170992
## 240 0.5169967
## 241 0.5169474
## 242 0.5168245
## 243 0.5166734
## 244 0.5166505
## 245 0.5166247
## 246 0.5165851
## 247 0.5164770
## 248 0.5162673
## 249 0.5162373
## 250 0.5161588
## 251 0.5161032
## 252 0.5159773
## 253 0.5159275
## 254 0.5158501
## 255 0.5158149
## 256 0.5158096
## 257 0.5157976
## 258 0.5157361
## 259 0.5156958
## 260 0.5156585
## 261 0.5155546
## 262 0.5155197
## 263 0.5154802
## 264 0.5154534
## 265 0.5154138
## 266 0.5154048
## 267 0.5153124
## 268 0.5152596
## 269 0.5152477
## 270 0.5152178
## 271 0.5152062
## 272 0.5150806
## 273 0.5149438
## 274 0.5148590
## 275 0.5147941
## 276 0.5146698
## 277 0.5146515
## 278 0.5146266
## 279 0.5145922
## 280 0.5145409
## 281 0.5144950
## 282 0.5143199
## 283 0.5139574
## 284 0.5137651
## 285 0.5137601
## 286 0.5136881
## 287 0.5136866
## 288 0.5136833
## 289 0.5136737
## 290 0.5136165
## 291 0.5135481
## 292 0.5134903
## 293 0.5134066
## 294 0.5133379
## 295 0.5132464
## 296 0.5132111
## 297 0.5131787
## 298 0.5129367
## 299 0.5128904
## 300 0.5128054
## 301 0.5128005
## 302 0.5124979
## 303 0.5124603
## 304 0.5122870
## 305 0.5120400
## 306 0.5119475
## 307 0.5119305
## 308 0.5118534
## 309 0.5118022
## 310 0.5117515
## 311 0.5117336
## 312 0.5116785
## 313 0.5116658
## 314 0.5116365
## 315 0.5116082
## 316 0.5115821
## 317 0.5113533
## 318 0.5112943
## 319 0.5112593
## 320 0.5112399
## 321 0.5112366
## 322 0.5112236
## 323 0.5111366
## 324 0.5111271
## 325 0.5110881
## 326 0.5110526
## 327 0.5110230
## 328 0.5109670
## 329 0.5109421
## 330 0.5109401
## 331 0.5109145
## 332 0.5109131
## 333 0.5108860
## 334 0.5108345
## 335 0.5108102
## 336 0.5106574
## 337 0.5105767
## 338 0.5105374
## 339 0.5104448
## 340 0.5104241
## 341 0.5103516
## 342 0.5103273
## 343 0.5103000
## 344 0.5102475
## 345 0.5101043
## 346 0.5101024
## 347 0.5100665
## 348 0.5100553
## 349 0.5100489
## 350 0.5099786
## 351 0.5098862
## 352 0.5098460
## 353 0.5097759
## 354 0.5097444
## 355 0.5097314
## 356 0.5095779
## 357 0.5095331
## 358 0.5095298
## 359 0.5094756
## 360 0.5094464
## 361 0.5094452
## 362 0.5094304
## 363 0.5094242
## 364 0.5094170
## 365 0.5094111
## 366 0.5093015
## 367 0.5092691
## 368 0.5092287
## 369 0.5091699
## 370 0.5091623
## 371 0.5091308
## 372 0.5091231
## 373 0.5091041
## 374 0.5089959
## 375 0.5089532
## 376 0.5089411
## 377 0.5088537
## 378 0.5087705
## 379 0.5086792
## 380 0.5086715
## 381 0.5086589
## 382 0.5086005
## 383 0.5085503
## 384 0.5085360
## 385 0.5085348
## 386 0.5085194
## 387 0.5085175
## 388 0.5084354
## 389 0.5084184
## 390 0.5083992
## 391 0.5083696
## 392 0.5082854
## 393 0.5082543
## 394 0.5081701
## 395 0.5081005
## 396 0.5080613
## 397 0.5080311
## 398 0.5079346
## 399 0.5079184
## 400 0.5078866
## 401 0.5078786
## 402 0.5078781
## 403 0.5077948
## 404 0.5077667
## 405 0.5077371
## 406 0.5077287
## 407 0.5076558
## 408 0.5076496
## 409 0.5076346
## 410 0.5076263
## 411 0.5075553
## 412 0.5075441
## 413 0.5075341
## 414 0.5074673
## 415 0.5073644
## 416 0.5073598
## 417 0.5072705
## 418 0.5072343
## 419 0.5072106
## 420 0.5071908
## 421 0.5071643
## 422 0.5071618
## 423 0.5071256
## 424 0.5070698
## 425 0.5070202
## 426 0.5070200
## 427 0.5070100
## 428 0.5069612
## 429 0.5069419
## 430 0.5069396
## 431 0.5069327
## 432 0.5069131
## 433 0.5069008
## 434 0.5068722
## 435 0.5068372
## 436 0.5068087
## 437 0.5067136
## 438 0.5066906
## 439 0.5066838
## 440 0.5065907
## 441 0.5065482
## 442 0.5064425
## 443 0.5064271
## 444 0.5063655
## 445 0.5062918
## 446 0.5062576
## 447 0.5062165
## 448 0.5061954
## 449 0.5061937
## 450 0.5061853
## 451 0.5061479
## 452 0.5061371
## 453 0.5061282
## 454 0.5061216
## 455 0.5061179
## 456 0.5060972
## 457 0.5060965
## 458 0.5060788
## 459 0.5060666
## 460 0.5060528
## 461 0.5059973
## 462 0.5059815
## 463 0.5059400
## 464 0.5059344
## 465 0.5059151
## 466 0.5058097
## 467 0.5057832
## 468 0.5056773
## 469 0.5056408
## 470 0.5055937
## 471 0.5055838
## 472 0.5055762
## 473 0.5055528
## 474 0.5055215
## 475 0.5055056
## 476 0.5054765
## 477 0.5054336
## 478 0.5054121
## 479 0.5053508
## 480 0.5053316
## 481 0.5053121
## 482 0.5052778
## 483 0.5051808
## 484 0.5051435
## 485 0.5051265
## 486 0.5051137
## 487 0.5050864
## 488 0.5050742
## 489 0.5050641
## 490 0.5050152
## 491 0.5048626
## 492 0.5047981
## 493 0.5047930
## 494 0.5047869
## 495 0.5047856
## 496 0.5047776
## 497 0.5047775
## 498 0.5047707
## 499 0.5047669
## 500 0.5047577
## 501 0.5047498
## 502 0.5046857
## 503 0.5046607
## 504 0.5046016
## 505 0.5045936
## 506 0.5045834
## 507 0.5045818
## 508 0.5044663
## 509 0.5043851
## 510 0.5043529
## 511 0.5043322
## 512 0.5042802
## 513 0.5042754
## 514 0.5042271
## 515 0.5041870
## 516 0.5041789
## 517 0.5041551
## 518 0.5041136
## 519 0.5040743
## 520 0.5040456
## 521 0.5040308
## 522 0.5040063
## 523 0.5040051
## 524 0.5039691
## 525 0.5039688
## 526 0.5039472
## 527 0.5039466
## 528 0.5038691
## 529 0.5038411
## 530 0.5037218
## 531 0.5037115
## 532 0.5037057
## 533 0.5036992
## 534 0.5036839
## 535 0.5036718
## 536 0.5036606
## 537 0.5036380
## 538 0.5036374
## 539 0.5036373
## 540 0.5036028
## 541 0.5035938
## 542 0.5035843
## 543 0.5035225
## 544 0.5035182
## 545 0.5035137
## 546 0.5034782
## 547 0.5034655
## 548 0.5034644
## 549 0.5034349
## 550 0.5034268
## 551 0.5034125
## 552 0.5034113
## 553 0.5034069
## 554 0.5034039
## 555 0.5033945
## 556 0.5033930
## 557 0.5033874
## 558 0.5033651
## 559 0.5033393
## 560 0.5033192
## 561 0.5033175
## 562 0.5032969
## 563 0.5032880
## 564 0.5032553
## 565 0.5032303
## 566 0.5031918
## 567 0.5031419
## 568 0.5031385
## 569 0.5031197
## 570 0.5031113
## 571 0.5031037
## 572 0.5030966
## 573 0.5030961
## 574 0.5030953
## 575 0.5030831
## 576 0.5030813
## 577 0.5030626
## 578 0.5030591
## 579 0.5030326
## 580 0.5030209
## 581 0.5030115
## 582 0.5029550
## 583 0.5029461
## 584 0.5029432
## 585 0.5029253
## 586 0.5029166
## 587 0.5027902
## 588 0.5027857
## 589 0.5027824
## 590 0.5027785
## 591 0.5027567
## 592 0.5027070
## 593 0.5026980
## 594 0.5026967
## 595 0.5026859
## 596 0.5026832
## 597 0.5026738
## 598 0.5026556
## 599 0.5026470
## 600 0.5026387
## 601 0.5026178
## 602 0.5026145
## 603 0.5026120
## 604 0.5026117
## 605 0.5025321
## 606 0.5025168
## 607 0.5025126
## 608 0.5024780
## 609 0.5024649
## 610 0.5024597
## 611 0.5024341
## 612 0.5024334
## 613 0.5024060
## 614 0.5023555
## 615 0.5023530
## 616 0.5023492
## 617 0.5023326
## 618 0.5023252
## 619 0.5023214
## 620 0.5022950
## 621 0.5022929
## 622 0.5022679
## 623 0.5022392
## 624 0.5022286
## 625 0.5022135
## 626 0.5022090
## 627 0.5022052
## 628 0.5021799
## 629 0.5021494
## 630 0.5021167
## 631 0.5021128
## 632 0.5020609
## 633 0.5020518
## 634 0.5020486
## 635 0.5020364
## 636 0.5020240
## 637 0.5019765
## 638 0.5019727
## 639 0.5019647
## 640 0.5019646
## 641 0.5019485
## 642 0.5019339
## 643 0.5019322
## 644 0.5019066
## 645 0.5018962
## 646 0.5018915
## 647 0.5018868
## 648 0.5018817
## 649 0.5018757
## 650 0.5018008
## 651 0.5017847
## 652 0.5017583
## 653 0.5017397
## 654 0.5017389
## 655 0.5017164
## 656 0.5017071
## 657 0.5017052
## 658 0.5016477
## 659 0.5016276
## 660 0.5015938
## 661 0.5015899
## 662 0.5015881
## 663 0.5015842
## 664 0.5015052
## 665 0.5014968
## 666 0.5014751
## 667 0.5014693
## 668 0.5014643
## 669 0.5014491
## 670 0.5014079
## 671 0.5013998
## 672 0.5013893
## 673 0.5013889
## 674 0.5013864
## 675 0.5013752
## 676 0.5013570
## 677 0.5013450
## 678 0.5013018
## 679 0.5012803
## 680 0.5012484
## 681 0.5012294
## 682 0.5011989
## 683 0.5011756
## 684 0.5011515
## 685 0.5011275
## 686 0.5011147
## 687 0.5010829
## 688 0.5010571
## 689 0.5010530
## 690 0.5010518
## 691 0.5010450
## 692 0.5010443
## 693 0.5010340
## 694 0.5010326
## 695 0.5010066
## 696 0.5009772
## 697 0.5009745
## 698 0.5009657
## 699 0.5009462
## 700 0.5009338
## 701 0.5009043
## 702 0.5009019
## 703 0.5008982
## 704 0.5008972
## 705 0.5008916
## 706 0.5008848
## 707 0.5008788
## 708 0.5008532
## 709 0.5008211
## 710 0.5008205
## 711 0.5008176
## 712 0.5007753
## 713 0.5007429
## 714 0.5007305
## 715 0.5007302
## 716 0.5007225
## 717 0.5007112
## 718 0.5007082
## 719 0.5006957
## 720 0.5006456
## 721 0.5006446
## 722 0.5006438
## 723 0.5006395
## 724 0.5006376
## 725 0.5006292
## 726 0.5006283
## 727 0.5006246
## 728 0.5006237
## 729 0.5006128
## 730 0.5006064
## 731 0.5006022
## 732 0.5005843
## 733 0.5005735
## 734 0.5005687
## 735 0.5005593
## 736 0.5005514
## 737 0.5005266
## 738 0.5005220
## 739 0.5005117
## 740 0.5005062
## 741 0.5004967
## 742 0.5004858
## 743 0.5004856
## 744 0.5004822
## 745 0.5004638
## 746 0.5004316
## 747 0.5003925
## 748 0.5003831
## 749 0.5003231
## 750 0.5003182
## 751 0.5003169
## 752 0.5003078
## 753 0.5002901
## 754 0.5002533
## 755 0.5002498
## 756 0.5002357
## 757 0.5002185
## 758 0.5001786
## 759 0.5001719
## 760 0.5001496
## 761 0.5001419
## 762 0.5001300
## 763 0.5001235
## 764 0.5001207
## 765 0.5001185
## 766 0.5000605
## 767 0.5000600
## 768 0.5000565
## 769 0.5000523
## 770 0.5000000
## 771 0.4999899
## 772 0.4999790
## 773 0.4999687
## 774 0.4999507
## 775 0.4999468
## 776 0.4999396
## 777 0.4999301
## 778 0.4999235
## 779 0.4999230
## 780 0.4999176
## 781 0.4999143
## 782 0.4999044
## 783 0.4998700
## 784 0.4998570
## 785 0.4998507
## 786 0.4997968
## 787 0.4997899
## 788 0.4997580
## 789 0.4997570
## 790 0.4997552
## 791 0.4997356
## 792 0.4997320
## 793 0.4996710
## 794 0.4996673
## 795 0.4996505
## 796 0.4996454
## 797 0.4996346
## 798 0.4996335
## 799 0.4996269
## 800 0.4996069
## 801 0.4995799
## 802 0.4995502
## 803 0.4995360
## 804 0.4995215
## 805 0.4995089
## 806 0.4994999
## 807 0.4994824
## 808 0.4994768
## 809 0.4994749
## 810 0.4994711
## 811 0.4994411
## 812 0.4993978
## 813 0.4993395
## 814 0.4993258
## 815 0.4992801
## 816 0.4992602
## 817 0.4992368
## 818 0.4992315
## 819 0.4992055
## 820 0.4991922
## 821 0.4991895
## 822 0.4991557
## 823 0.4991518
## 824 0.4991205
## 825 0.4991142
## 826 0.4991054
## 827 0.4991006
## 828 0.4990732
## 829 0.4990634
## 830 0.4990626
## 831 0.4990343
## 832 0.4990336
## 833 0.4990323
## 834 0.4989690
## 835 0.4989630
## 836 0.4989445
## 837 0.4989319
## 838 0.4989144
## 839 0.4988961
## 840 0.4988957
## 841 0.4988936
## 842 0.4988902
## 843 0.4988871
## 844 0.4988850
## 845 0.4988747
## 846 0.4988718
## 847 0.4988231
## 848 0.4988169
## 849 0.4987743
## 850 0.4987716
## 851 0.4987344
## 852 0.4987148
## 853 0.4987101
## 854 0.4986945
## 855 0.4986772
## 856 0.4986318
## 857 0.4986224
## 858 0.4986220
## 859 0.4986212
## 860 0.4985887
## 861 0.4985660
## 862 0.4985564
## 863 0.4985476
## 864 0.4985224
## 865 0.4985205
## 866 0.4985099
## 867 0.4984845
## 868 0.4984651
## 869 0.4984592
## 870 0.4984586
## 871 0.4984586
## 872 0.4984562
## 873 0.4984420
## 874 0.4984306
## 875 0.4984251
## 876 0.4984092
## 877 0.4984088
## 878 0.4983733
## 879 0.4983729
## 880 0.4983495
## 881 0.4983380
## 882 0.4983259
## 883 0.4983205
## 884 0.4983203
## 885 0.4983192
## 886 0.4983170
## 887 0.4982993
## 888 0.4982838
## 889 0.4982544
## 890 0.4982520
## 891 0.4982515
## 892 0.4982257
## 893 0.4982117
## 894 0.4981924
## 895 0.4981774
## 896 0.4981627
## 897 0.4981454
## 898 0.4981309
## 899 0.4981303
## 900 0.4981163
## 901 0.4981079
## 902 0.4981010
## 903 0.4980947
## 904 0.4980912
## 905 0.4980853
## 906 0.4980770
## 907 0.4980667
## 908 0.4980578
## 909 0.4980524
## 910 0.4980486
## 911 0.4980267
## 912 0.4980015
## 913 0.4979955
## 914 0.4979845
## 915 0.4979757
## 916 0.4979669
## 917 0.4979565
## 918 0.4979471
## 919 0.4979426
## 920 0.4978980
## 921 0.4978680
## 922 0.4978663
## 923 0.4978479
## 924 0.4978217
## 925 0.4978209
## 926 0.4978094
## 927 0.4978008
## 928 0.4977922
## 929 0.4977666
## 930 0.4977650
## 931 0.4977593
## 932 0.4977571
## 933 0.4977543
## 934 0.4976792
## 935 0.4976618
## 936 0.4976542
## 937 0.4976524
## 938 0.4976195
## 939 0.4976194
## 940 0.4976109
## 941 0.4976069
## 942 0.4975897
## 943 0.4975793
## 944 0.4975764
## 945 0.4975388
## 946 0.4975373
## 947 0.4975309
## 948 0.4975107
## 949 0.4975088
## 950 0.4975032
## 951 0.4974921
## 952 0.4974878
## 953 0.4974663
## 954 0.4974549
## 955 0.4974371
## 956 0.4974347
## 957 0.4974080
## 958 0.4973877
## 959 0.4973707
## 960 0.4973608
## 961 0.4973581
## 962 0.4973509
## 963 0.4973469
## 964 0.4973469
## 965 0.4973225
## 966 0.4973194
## 967 0.4973185
## 968 0.4973091
## 969 0.4973024
## 970 0.4972920
## 971 0.4972795
## 972 0.4972775
## 973 0.4972606
## 974 0.4972467
## 975 0.4972444
## 976 0.4972092
## 977 0.4971826
## 978 0.4971500
## 979 0.4971297
## 980 0.4971180
## 981 0.4970825
## 982 0.4970470
## 983 0.4970396
## 984 0.4970265
## 985 0.4970232
## 986 0.4970182
## 987 0.4970146
## 988 0.4970064
## 989 0.4970021
## 990 0.4969934
## 991 0.4969926
## 992 0.4969832
## 993 0.4969485
## 994 0.4969434
## 995 0.4969290
## 996 0.4969166
## 997 0.4969097
## 998 0.4969068
## 999 0.4968929
## 1000 0.4968911
## 1001 0.4968806
## 1002 0.4968762
## 1003 0.4968680
## 1004 0.4968465
## 1005 0.4968416
## 1006 0.4968225
## 1007 0.4967843
## 1008 0.4967805
## 1009 0.4967792
## 1010 0.4967735
## 1011 0.4967516
## 1012 0.4967445
## 1013 0.4967320
## 1014 0.4967298
## 1015 0.4967046
## 1016 0.4966631
## 1017 0.4966336
## 1018 0.4966133
## 1019 0.4965820
## 1020 0.4965712
## 1021 0.4965649
## 1022 0.4965478
## 1023 0.4965417
## 1024 0.4965366
## 1025 0.4965235
## 1026 0.4965084
## 1027 0.4965080
## 1028 0.4965051
## 1029 0.4964999
## 1030 0.4964936
## 1031 0.4964912
## 1032 0.4964787
## 1033 0.4964616
## 1034 0.4964402
## 1035 0.4964327
## 1036 0.4964236
## 1037 0.4963920
## 1038 0.4963918
## 1039 0.4963802
## 1040 0.4963798
## 1041 0.4963750
## 1042 0.4963723
## 1043 0.4963603
## 1044 0.4963358
## 1045 0.4963079
## 1046 0.4963060
## 1047 0.4963052
## 1048 0.4962919
## 1049 0.4962603
## 1050 0.4962548
## 1051 0.4962241
## 1052 0.4962011
## 1053 0.4961981
## 1054 0.4961580
## 1055 0.4961477
## 1056 0.4961191
## 1057 0.4961086
## 1058 0.4960974
## 1059 0.4960970
## 1060 0.4960965
## 1061 0.4960761
## 1062 0.4960749
## 1063 0.4960686
## 1064 0.4960666
## 1065 0.4960526
## 1066 0.4960268
## 1067 0.4960237
## 1068 0.4960075
## 1069 0.4959953
## 1070 0.4959921
## 1071 0.4959740
## 1072 0.4959704
## 1073 0.4959696
## 1074 0.4959574
## 1075 0.4959508
## 1076 0.4959498
## 1077 0.4959203
## 1078 0.4959197
## 1079 0.4958998
## 1080 0.4958883
## 1081 0.4958775
## 1082 0.4958762
## 1083 0.4958299
## 1084 0.4958278
## 1085 0.4958255
## 1086 0.4958254
## 1087 0.4958098
## 1088 0.4957380
## 1089 0.4957280
## 1090 0.4957194
## 1091 0.4956882
## 1092 0.4956834
## 1093 0.4956756
## 1094 0.4956646
## 1095 0.4956327
## 1096 0.4956107
## 1097 0.4956065
## 1098 0.4955980
## 1099 0.4955919
## 1100 0.4955915
## 1101 0.4955861
## 1102 0.4955814
## 1103 0.4955564
## 1104 0.4955537
## 1105 0.4955478
## 1106 0.4955390
## 1107 0.4955045
## 1108 0.4954986
## 1109 0.4954879
## 1110 0.4954847
## 1111 0.4954812
## 1112 0.4954777
## 1113 0.4954766
## 1114 0.4954757
## 1115 0.4954689
## 1116 0.4954592
## 1117 0.4954540
## 1118 0.4954254
## 1119 0.4954112
## 1120 0.4954102
## 1121 0.4954080
## 1122 0.4953937
## 1123 0.4953892
## 1124 0.4953877
## 1125 0.4953867
## 1126 0.4953858
## 1127 0.4953840
## 1128 0.4953837
## 1129 0.4953790
## 1130 0.4953622
## 1131 0.4953537
## 1132 0.4953523
## 1133 0.4953491
## 1134 0.4953452
## 1135 0.4953370
## 1136 0.4953232
## 1137 0.4953231
## 1138 0.4952918
## 1139 0.4952911
## 1140 0.4952807
## 1141 0.4952788
## 1142 0.4952764
## 1143 0.4952659
## 1144 0.4952537
## 1145 0.4952497
## 1146 0.4952487
## 1147 0.4952458
## 1148 0.4952225
## 1149 0.4952194
## 1150 0.4952181
## 1151 0.4952033
## 1152 0.4952009
## 1153 0.4951608
## 1154 0.4951583
## 1155 0.4951504
## 1156 0.4951306
## 1157 0.4951154
## 1158 0.4951106
## 1159 0.4950844
## 1160 0.4950817
## 1161 0.4950671
## 1162 0.4950544
## 1163 0.4950468
## 1164 0.4950411
## 1165 0.4950396
## 1166 0.4950356
## 1167 0.4950195
## 1168 0.4950141
## 1169 0.4950015
## 1170 0.4949968
## 1171 0.4949964
## 1172 0.4949880
## 1173 0.4949690
## 1174 0.4949501
## 1175 0.4949479
## 1176 0.4949377
## 1177 0.4949229
## 1178 0.4949013
## 1179 0.4948855
## 1180 0.4948552
## 1181 0.4948526
## 1182 0.4948441
## 1183 0.4948394
## 1184 0.4948019
## 1185 0.4947902
## 1186 0.4947845
## 1187 0.4947667
## 1188 0.4947625
## 1189 0.4947566
## 1190 0.4947497
## 1191 0.4947447
## 1192 0.4947422
## 1193 0.4947331
## 1194 0.4947280
## 1195 0.4947159
## 1196 0.4947119
## 1197 0.4947058
## 1198 0.4946929
## 1199 0.4946895
## 1200 0.4946873
## 1201 0.4946833
## 1202 0.4946786
## 1203 0.4946626
## 1204 0.4946566
## 1205 0.4946548
## 1206 0.4946471
## 1207 0.4946331
## 1208 0.4946277
## 1209 0.4945978
## 1210 0.4945886
## 1211 0.4945759
## 1212 0.4945753
## 1213 0.4945747
## 1214 0.4945597
## 1215 0.4945495
## 1216 0.4945400
## 1217 0.4944973
## 1218 0.4944679
## 1219 0.4944668
## 1220 0.4944639
## 1221 0.4944619
## 1222 0.4944543
## 1223 0.4944516
## 1224 0.4944500
## 1225 0.4944420
## 1226 0.4944333
## 1227 0.4944282
## 1228 0.4944206
## 1229 0.4943985
## 1230 0.4943963
## 1231 0.4943933
## 1232 0.4943474
## 1233 0.4943438
## 1234 0.4943351
## 1235 0.4943258
## 1236 0.4943210
## 1237 0.4942979
## 1238 0.4942975
## 1239 0.4942798
## 1240 0.4942779
## 1241 0.4942701
## 1242 0.4942682
## 1243 0.4942633
## 1244 0.4942444
## 1245 0.4942316
## 1246 0.4942276
## 1247 0.4942209
## 1248 0.4942190
## 1249 0.4942190
## 1250 0.4942162
## 1251 0.4942116
## 1252 0.4942052
## 1253 0.4941966
## 1254 0.4941824
## 1255 0.4941778
## 1256 0.4941772
## 1257 0.4941772
## 1258 0.4941702
## 1259 0.4941661
## 1260 0.4941646
## 1261 0.4941489
## 1262 0.4941361
## 1263 0.4941358
## 1264 0.4941320
## 1265 0.4941218
## 1266 0.4941082
## 1267 0.4941066
## 1268 0.4940849
## 1269 0.4940841
## 1270 0.4940757
## 1271 0.4940626
## 1272 0.4940565
## 1273 0.4940558
## 1274 0.4940425
## 1275 0.4940315
## 1276 0.4940099
## 1277 0.4940037
## 1278 0.4939874
## 1279 0.4939847
## 1280 0.4939652
## 1281 0.4939634
## 1282 0.4939537
## 1283 0.4939440
## 1284 0.4939387
## 1285 0.4939212
## 1286 0.4939142
## 1287 0.4938962
## 1288 0.4938940
## 1289 0.4938596
## 1290 0.4938575
## 1291 0.4938483
## 1292 0.4938426
## 1293 0.4938123
## 1294 0.4938039
## 1295 0.4937963
## 1296 0.4937638
## 1297 0.4937635
## 1298 0.4937623
## 1299 0.4937610
## 1300 0.4937529
## 1301 0.4937480
## 1302 0.4937445
## 1303 0.4937407
## 1304 0.4937405
## 1305 0.4937404
## 1306 0.4937377
## 1307 0.4937246
## 1308 0.4937213
## 1309 0.4937115
## 1310 0.4937086
## 1311 0.4936997
## 1312 0.4936988
## 1313 0.4936951
## 1314 0.4936782
## 1315 0.4936762
## 1316 0.4936752
## 1317 0.4936734
## 1318 0.4936627
## 1319 0.4936617
## 1320 0.4936343
## 1321 0.4936119
## 1322 0.4936056
## 1323 0.4935959
## 1324 0.4935892
## 1325 0.4935835
## 1326 0.4935792
## 1327 0.4935624
## 1328 0.4935482
## 1329 0.4935118
## 1330 0.4935115
## 1331 0.4935100
## 1332 0.4935061
## 1333 0.4934920
## 1334 0.4934606
## 1335 0.4934566
## 1336 0.4934537
## 1337 0.4934533
## 1338 0.4934493
## 1339 0.4934487
## 1340 0.4934482
## 1341 0.4934331
## 1342 0.4934328
## 1343 0.4934312
## 1344 0.4934140
## 1345 0.4933930
## 1346 0.4933906
## 1347 0.4933705
## 1348 0.4933658
## 1349 0.4933472
## 1350 0.4933355
## 1351 0.4933346
## 1352 0.4933031
## 1353 0.4932599
## 1354 0.4932477
## 1355 0.4932422
## 1356 0.4932393
## 1357 0.4932346
## 1358 0.4932227
## 1359 0.4932149
## 1360 0.4932091
## 1361 0.4932005
## 1362 0.4931851
## 1363 0.4931806
## 1364 0.4931703
## 1365 0.4931671
## 1366 0.4931659
## 1367 0.4931434
## 1368 0.4931287
## 1369 0.4931192
## 1370 0.4931084
## 1371 0.4930766
## 1372 0.4930755
## 1373 0.4930695
## 1374 0.4930635
## 1375 0.4930402
## 1376 0.4930198
## 1377 0.4929945
## 1378 0.4929904
## 1379 0.4929736
## 1380 0.4929701
## 1381 0.4929499
## 1382 0.4929407
## 1383 0.4929295
## 1384 0.4929294
## 1385 0.4929273
## 1386 0.4929248
## 1387 0.4929192
## 1388 0.4929088
## 1389 0.4929061
## 1390 0.4929053
## 1391 0.4928940
## 1392 0.4928925
## 1393 0.4928858
## 1394 0.4928562
## 1395 0.4928556
## 1396 0.4928459
## 1397 0.4928402
## 1398 0.4928274
## 1399 0.4928268
## 1400 0.4928057
## 1401 0.4928037
## 1402 0.4928007
## 1403 0.4927975
## 1404 0.4927858
## 1405 0.4927749
## 1406 0.4927657
## 1407 0.4927637
## 1408 0.4927311
## 1409 0.4927273
## 1410 0.4927218
## 1411 0.4927193
## 1412 0.4927097
## 1413 0.4926648
## 1414 0.4926439
## 1415 0.4926391
## 1416 0.4926270
## 1417 0.4926250
## 1418 0.4925900
## 1419 0.4925896
## 1420 0.4925669
## 1421 0.4925478
## 1422 0.4925470
## 1423 0.4925248
## 1424 0.4925056
## 1425 0.4925021
## 1426 0.4924901
## 1427 0.4924899
## 1428 0.4924836
## 1429 0.4924784
## 1430 0.4924717
## 1431 0.4924652
## 1432 0.4924557
## 1433 0.4924485
## 1434 0.4924482
## 1435 0.4924437
## 1436 0.4924210
## 1437 0.4924199
## 1438 0.4924188
## 1439 0.4924152
## 1440 0.4923967
## 1441 0.4923949
## 1442 0.4923935
## 1443 0.4923878
## 1444 0.4923791
## 1445 0.4923746
## 1446 0.4923560
## 1447 0.4923482
## 1448 0.4923356
## 1449 0.4923103
## 1450 0.4923100
## 1451 0.4923030
## 1452 0.4922965
## 1453 0.4922815
## 1454 0.4922729
## 1455 0.4922558
## 1456 0.4922393
## 1457 0.4922377
## 1458 0.4922339
## 1459 0.4922172
## 1460 0.4922100
## 1461 0.4921783
## 1462 0.4921757
## 1463 0.4921674
## 1464 0.4921648
## 1465 0.4921576
## 1466 0.4921478
## 1467 0.4921361
## 1468 0.4921277
## 1469 0.4921161
## 1470 0.4920912
## 1471 0.4920869
## 1472 0.4920754
## 1473 0.4920676
## 1474 0.4920643
## 1475 0.4920564
## 1476 0.4920564
## 1477 0.4920539
## 1478 0.4920482
## 1479 0.4920387
## 1480 0.4920334
## 1481 0.4920316
## 1482 0.4920246
## 1483 0.4920065
## 1484 0.4919988
## 1485 0.4919983
## 1486 0.4919977
## 1487 0.4919917
## 1488 0.4919852
## 1489 0.4919702
## 1490 0.4919676
## 1491 0.4919625
## 1492 0.4919593
## 1493 0.4919474
## 1494 0.4919013
## 1495 0.4918911
## 1496 0.4918887
## 1497 0.4918836
## 1498 0.4918439
## 1499 0.4918433
## 1500 0.4918391
## 1501 0.4918322
## 1502 0.4918275
## 1503 0.4918205
## 1504 0.4918108
## 1505 0.4918063
## 1506 0.4918043
## 1507 0.4918039
## 1508 0.4917676
## 1509 0.4917497
## 1510 0.4917455
## 1511 0.4917336
## 1512 0.4917335
## 1513 0.4917306
## 1514 0.4917289
## 1515 0.4917260
## 1516 0.4917197
## 1517 0.4917127
## 1518 0.4917104
## 1519 0.4917075
## 1520 0.4916950
## 1521 0.4916752
## 1522 0.4916651
## 1523 0.4916124
## 1524 0.4915808
## 1525 0.4915692
## 1526 0.4915692
## 1527 0.4915657
## 1528 0.4915420
## 1529 0.4915265
## 1530 0.4915022
## 1531 0.4915015
## 1532 0.4914702
## 1533 0.4914563
## 1534 0.4914413
## 1535 0.4914357
## 1536 0.4914300
## 1537 0.4914259
## 1538 0.4914173
## 1539 0.4914030
## 1540 0.4913996
## 1541 0.4913969
## 1542 0.4913916
## 1543 0.4913835
## 1544 0.4913787
## 1545 0.4913672
## 1546 0.4913605
## 1547 0.4913580
## 1548 0.4913455
## 1549 0.4913197
## 1550 0.4913173
## 1551 0.4913164
## 1552 0.4913115
## 1553 0.4913095
## 1554 0.4912997
## 1555 0.4912928
## 1556 0.4912922
## 1557 0.4912900
## 1558 0.4912793
## 1559 0.4912619
## 1560 0.4912549
## 1561 0.4912541
## 1562 0.4912513
## 1563 0.4912459
## 1564 0.4912358
## 1565 0.4912086
## 1566 0.4912029
## 1567 0.4912027
## 1568 0.4911923
## 1569 0.4911301
## 1570 0.4911253
## 1571 0.4911243
## 1572 0.4911192
## 1573 0.4911157
## 1574 0.4911022
## 1575 0.4911012
## 1576 0.4910999
## 1577 0.4910872
## 1578 0.4910731
## 1579 0.4910629
## 1580 0.4910530
## 1581 0.4910476
## 1582 0.4910125
## 1583 0.4910036
## 1584 0.4909962
## 1585 0.4909923
## 1586 0.4909878
## 1587 0.4909747
## 1588 0.4909562
## 1589 0.4909468
## 1590 0.4909436
## 1591 0.4909055
## 1592 0.4908984
## 1593 0.4908972
## 1594 0.4908836
## 1595 0.4908708
## 1596 0.4908575
## 1597 0.4908508
## 1598 0.4908462
## 1599 0.4908445
## 1600 0.4908278
## 1601 0.4908272
## 1602 0.4908232
## 1603 0.4908151
## 1604 0.4908106
## 1605 0.4907991
## 1606 0.4907979
## 1607 0.4907703
## 1608 0.4907556
## 1609 0.4907479
## 1610 0.4907223
## 1611 0.4907009
## 1612 0.4906741
## 1613 0.4906554
## 1614 0.4906548
## 1615 0.4906418
## 1616 0.4906403
## 1617 0.4906399
## 1618 0.4906087
## 1619 0.4906013
## 1620 0.4905953
## 1621 0.4905883
## 1622 0.4905334
## 1623 0.4905304
## 1624 0.4905299
## 1625 0.4905194
## 1626 0.4905133
## 1627 0.4905128
## 1628 0.4905034
## 1629 0.4904736
## 1630 0.4904707
## 1631 0.4904675
## 1632 0.4904625
## 1633 0.4904621
## 1634 0.4904368
## 1635 0.4904310
## 1636 0.4904210
## 1637 0.4904166
## 1638 0.4904000
## 1639 0.4903854
## 1640 0.4903543
## 1641 0.4903520
## 1642 0.4903447
## 1643 0.4903415
## 1644 0.4903388
## 1645 0.4903330
## 1646 0.4903000
## 1647 0.4902957
## 1648 0.4902847
## 1649 0.4902350
## 1650 0.4902141
## 1651 0.4902021
## 1652 0.4902013
## 1653 0.4901623
## 1654 0.4901601
## 1655 0.4901520
## 1656 0.4901398
## 1657 0.4901266
## 1658 0.4901191
## 1659 0.4901190
## 1660 0.4901072
## 1661 0.4900995
## 1662 0.4900832
## 1663 0.4900804
## 1664 0.4900717
## 1665 0.4900608
## 1666 0.4900511
## 1667 0.4900504
## 1668 0.4900387
## 1669 0.4900362
## 1670 0.4900341
## 1671 0.4900196
## 1672 0.4900130
## 1673 0.4900117
## 1674 0.4899836
## 1675 0.4899356
## 1676 0.4899324
## 1677 0.4899311
## 1678 0.4899195
## 1679 0.4899134
## 1680 0.4899062
## 1681 0.4898934
## 1682 0.4898592
## 1683 0.4898334
## 1684 0.4898246
## 1685 0.4898221
## 1686 0.4898136
## 1687 0.4898109
## 1688 0.4898047
## 1689 0.4898024
## 1690 0.4897776
## 1691 0.4897775
## 1692 0.4897664
## 1693 0.4897628
## 1694 0.4897619
## 1695 0.4897466
## 1696 0.4897386
## 1697 0.4897370
## 1698 0.4897314
## 1699 0.4897018
## 1700 0.4896879
## 1701 0.4896842
## 1702 0.4896812
## 1703 0.4896761
## 1704 0.4896680
## 1705 0.4896397
## 1706 0.4896326
## 1707 0.4896273
## 1708 0.4896236
## 1709 0.4895823
## 1710 0.4895642
## 1711 0.4895635
## 1712 0.4895551
## 1713 0.4895486
## 1714 0.4895459
## 1715 0.4895347
## 1716 0.4895262
## 1717 0.4895255
## 1718 0.4895012
## 1719 0.4894990
## 1720 0.4894980
## 1721 0.4894929
## 1722 0.4894662
## 1723 0.4894524
## 1724 0.4894411
## 1725 0.4894237
## 1726 0.4894216
## 1727 0.4894040
## 1728 0.4893971
## 1729 0.4893617
## 1730 0.4893590
## 1731 0.4893514
## 1732 0.4893358
## 1733 0.4893285
## 1734 0.4893277
## 1735 0.4893154
## 1736 0.4892961
## 1737 0.4892815
## 1738 0.4892787
## 1739 0.4892780
## 1740 0.4892763
## 1741 0.4892270
## 1742 0.4892218
## 1743 0.4892214
## 1744 0.4892184
## 1745 0.4892154
## 1746 0.4892096
## 1747 0.4892081
## 1748 0.4891989
## 1749 0.4891706
## 1750 0.4891317
## 1751 0.4891245
## 1752 0.4891137
## 1753 0.4890859
## 1754 0.4890706
## 1755 0.4890500
## 1756 0.4890399
## 1757 0.4890270
## 1758 0.4890244
## 1759 0.4890146
## 1760 0.4889964
## 1761 0.4889875
## 1762 0.4889623
## 1763 0.4889354
## 1764 0.4889312
## 1765 0.4889250
## 1766 0.4889240
## 1767 0.4888935
## 1768 0.4888682
## 1769 0.4888663
## 1770 0.4888608
## 1771 0.4888504
## 1772 0.4888413
## 1773 0.4888239
## 1774 0.4888133
## 1775 0.4888079
## 1776 0.4887613
## 1777 0.4887514
## 1778 0.4887418
## 1779 0.4887405
## 1780 0.4887398
## 1781 0.4887380
## 1782 0.4887132
## 1783 0.4887089
## 1784 0.4887074
## 1785 0.4886883
## 1786 0.4886529
## 1787 0.4886520
## 1788 0.4886349
## 1789 0.4886268
## 1790 0.4886255
## 1791 0.4886181
## 1792 0.4886060
## 1793 0.4885993
## 1794 0.4885725
## 1795 0.4885045
## 1796 0.4885006
## 1797 0.4884921
## 1798 0.4884910
## 1799 0.4884866
## 1800 0.4884866
## 1801 0.4884833
## 1802 0.4884797
## 1803 0.4884724
## 1804 0.4884671
## 1805 0.4884656
## 1806 0.4884626
## 1807 0.4884507
## 1808 0.4884368
## 1809 0.4884354
## 1810 0.4884189
## 1811 0.4883949
## 1812 0.4883695
## 1813 0.4883476
## 1814 0.4883039
## 1815 0.4882897
## 1816 0.4882569
## 1817 0.4882561
## 1818 0.4882122
## 1819 0.4882062
## 1820 0.4881867
## 1821 0.4881520
## 1822 0.4881514
## 1823 0.4881438
## 1824 0.4881426
## 1825 0.4881349
## 1826 0.4881322
## 1827 0.4881301
## 1828 0.4881239
## 1829 0.4881232
## 1830 0.4881010
## 1831 0.4880479
## 1832 0.4880306
## 1833 0.4880106
## 1834 0.4880059
## 1835 0.4880047
## 1836 0.4880029
## 1837 0.4879696
## 1838 0.4879573
## 1839 0.4879539
## 1840 0.4879465
## 1841 0.4879033
## 1842 0.4878991
## 1843 0.4878714
## 1844 0.4878680
## 1845 0.4878654
## 1846 0.4878497
## 1847 0.4878443
## 1848 0.4878243
## 1849 0.4878237
## 1850 0.4877756
## 1851 0.4877667
## 1852 0.4877599
## 1853 0.4877556
## 1854 0.4877514
## 1855 0.4877497
## 1856 0.4877495
## 1857 0.4877388
## 1858 0.4877378
## 1859 0.4877208
## 1860 0.4877102
## 1861 0.4876945
## 1862 0.4876565
## 1863 0.4876368
## 1864 0.4876362
## 1865 0.4876163
## 1866 0.4875849
## 1867 0.4875802
## 1868 0.4875800
## 1869 0.4875761
## 1870 0.4875479
## 1871 0.4875298
## 1872 0.4875132
## 1873 0.4874857
## 1874 0.4874706
## 1875 0.4874585
## 1876 0.4874401
## 1877 0.4873936
## 1878 0.4873766
## 1879 0.4873686
## 1880 0.4873671
## 1881 0.4873607
## 1882 0.4873558
## 1883 0.4873372
## 1884 0.4873361
## 1885 0.4873118
## 1886 0.4872867
## 1887 0.4872774
## 1888 0.4872270
## 1889 0.4872201
## 1890 0.4871982
## 1891 0.4871958
## 1892 0.4871764
## 1893 0.4871665
## 1894 0.4871243
## 1895 0.4871126
## 1896 0.4870881
## 1897 0.4870732
## 1898 0.4870640
## 1899 0.4870503
## 1900 0.4870482
## 1901 0.4870450
## 1902 0.4870271
## 1903 0.4870217
## 1904 0.4869941
## 1905 0.4869874
## 1906 0.4869729
## 1907 0.4869193
## 1908 0.4868907
## 1909 0.4868895
## 1910 0.4868748
## 1911 0.4868657
## 1912 0.4868535
## 1913 0.4868436
## 1914 0.4868353
## 1915 0.4868313
## 1916 0.4867793
## 1917 0.4867781
## 1918 0.4867615
## 1919 0.4867403
## 1920 0.4867205
## 1921 0.4866979
## 1922 0.4866872
## 1923 0.4866798
## 1924 0.4866757
## 1925 0.4866613
## 1926 0.4866507
## 1927 0.4866416
## 1928 0.4866370
## 1929 0.4866267
## 1930 0.4866101
## 1931 0.4866062
## 1932 0.4866032
## 1933 0.4865821
## 1934 0.4865687
## 1935 0.4865389
## 1936 0.4865317
## 1937 0.4865030
## 1938 0.4864694
## 1939 0.4864541
## 1940 0.4864488
## 1941 0.4864359
## 1942 0.4864065
## 1943 0.4863710
## 1944 0.4863652
## 1945 0.4863517
## 1946 0.4863184
## 1947 0.4863099
## 1948 0.4862876
## 1949 0.4862774
## 1950 0.4862590
## 1951 0.4862473
## 1952 0.4862175
## 1953 0.4862097
## 1954 0.4862076
## 1955 0.4862067
## 1956 0.4861778
## 1957 0.4861698
## 1958 0.4861636
## 1959 0.4861603
## 1960 0.4861480
## 1961 0.4861311
## 1962 0.4861077
## 1963 0.4861000
## 1964 0.4860965
## 1965 0.4860917
## 1966 0.4860719
## 1967 0.4860129
## 1968 0.4860118
## 1969 0.4859841
## 1970 0.4859841
## 1971 0.4859839
## 1972 0.4859519
## 1973 0.4859403
## 1974 0.4859377
## 1975 0.4859360
## 1976 0.4859212
## 1977 0.4859096
## 1978 0.4859064
## 1979 0.4859048
## 1980 0.4859034
## 1981 0.4858979
## 1982 0.4858904
## 1983 0.4858653
## 1984 0.4858603
## 1985 0.4857728
## 1986 0.4857350
## 1987 0.4856794
## 1988 0.4856747
## 1989 0.4856720
## 1990 0.4856555
## 1991 0.4856409
## 1992 0.4856369
## 1993 0.4856236
## 1994 0.4856126
## 1995 0.4856120
## 1996 0.4856073
## 1997 0.4855697
## 1998 0.4855654
## 1999 0.4855602
## 2000 0.4855532
## 2001 0.4855515
## 2002 0.4855504
## 2003 0.4854965
## 2004 0.4854857
## 2005 0.4854840
## 2006 0.4854836
## 2007 0.4854514
## 2008 0.4854449
## 2009 0.4854263
## 2010 0.4853983
## 2011 0.4853960
## 2012 0.4853853
## 2013 0.4853529
## 2014 0.4853316
## 2015 0.4852611
## 2016 0.4852605
## 2017 0.4852559
## 2018 0.4852360
## 2019 0.4852334
## 2020 0.4852055
## 2021 0.4852025
## 2022 0.4851930
## 2023 0.4851625
## 2024 0.4851512
## 2025 0.4851325
## 2026 0.4851263
## 2027 0.4851082
## 2028 0.4851072
## 2029 0.4850903
## 2030 0.4850736
## 2031 0.4850192
## 2032 0.4850010
## 2033 0.4849638
## 2034 0.4849510
## 2035 0.4849360
## 2036 0.4848384
## 2037 0.4848290
## 2038 0.4848163
## 2039 0.4847694
## 2040 0.4847581
## 2041 0.4847565
## 2042 0.4847421
## 2043 0.4847401
## 2044 0.4847398
## 2045 0.4847256
## 2046 0.4846116
## 2047 0.4845635
## 2048 0.4845406
## 2049 0.4845306
## 2050 0.4844990
## 2051 0.4844627
## 2052 0.4844554
## 2053 0.4844518
## 2054 0.4844478
## 2055 0.4844359
## 2056 0.4844308
## 2057 0.4844114
## 2058 0.4843937
## 2059 0.4843932
## 2060 0.4843915
## 2061 0.4843833
## 2062 0.4843750
## 2063 0.4843555
## 2064 0.4843266
## 2065 0.4843243
## 2066 0.4842855
## 2067 0.4842716
## 2068 0.4842443
## 2069 0.4842206
## 2070 0.4842205
## 2071 0.4842039
## 2072 0.4841678
## 2073 0.4841663
## 2074 0.4841661
## 2075 0.4841636
## 2076 0.4841051
## 2077 0.4841015
## 2078 0.4840992
## 2079 0.4840877
## 2080 0.4840793
## 2081 0.4840574
## 2082 0.4840470
## 2083 0.4840322
## 2084 0.4840226
## 2085 0.4840072
## 2086 0.4840046
## 2087 0.4839953
## 2088 0.4839887
## 2089 0.4839662
## 2090 0.4839575
## 2091 0.4839512
## 2092 0.4839469
## 2093 0.4839178
## 2094 0.4839062
## 2095 0.4838826
## 2096 0.4838483
## 2097 0.4838214
## 2098 0.4837958
## 2099 0.4837908
## 2100 0.4837640
## 2101 0.4837466
## 2102 0.4837436
## 2103 0.4837108
## 2104 0.4837076
## 2105 0.4837040
## 2106 0.4836900
## 2107 0.4836818
## 2108 0.4836443
## 2109 0.4836167
## 2110 0.4835515
## 2111 0.4835220
## 2112 0.4835002
## 2113 0.4834494
## 2114 0.4834451
## 2115 0.4834275
## 2116 0.4834159
## 2117 0.4834076
## 2118 0.4834026
## 2119 0.4833758
## 2120 0.4833649
## 2121 0.4833510
## 2122 0.4832992
## 2123 0.4832756
## 2124 0.4832110
## 2125 0.4831785
## 2126 0.4831464
## 2127 0.4831335
## 2128 0.4831203
## 2129 0.4831095
## 2130 0.4830894
## 2131 0.4829798
## 2132 0.4829722
## 2133 0.4829683
## 2134 0.4829423
## 2135 0.4828986
## 2136 0.4828502
## 2137 0.4828000
## 2138 0.4827994
## 2139 0.4827755
## 2140 0.4827741
## 2141 0.4827706
## 2142 0.4827393
## 2143 0.4827242
## 2144 0.4827127
## 2145 0.4826977
## 2146 0.4826914
## 2147 0.4826852
## 2148 0.4826780
## 2149 0.4826778
## 2150 0.4826513
## 2151 0.4826347
## 2152 0.4825893
## 2153 0.4825838
## 2154 0.4825545
## 2155 0.4825410
## 2156 0.4825260
## 2157 0.4825109
## 2158 0.4825109
## 2159 0.4824853
## 2160 0.4824783
## 2161 0.4824719
## 2162 0.4824717
## 2163 0.4824423
## 2164 0.4824258
## 2165 0.4824188
## 2166 0.4823580
## 2167 0.4823171
## 2168 0.4823151
## 2169 0.4823125
## 2170 0.4822860
## 2171 0.4822812
## 2172 0.4822769
## 2173 0.4822765
## 2174 0.4822255
## 2175 0.4821887
## 2176 0.4821754
## 2177 0.4821233
## 2178 0.4821127
## 2179 0.4821009
## 2180 0.4820787
## 2181 0.4820747
## 2182 0.4820523
## 2183 0.4820508
## 2184 0.4820373
## 2185 0.4820326
## 2186 0.4819289
## 2187 0.4818689
## 2188 0.4817392
## 2189 0.4817094
## 2190 0.4816555
## 2191 0.4816387
## 2192 0.4815704
## 2193 0.4814970
## 2194 0.4814945
## 2195 0.4814940
## 2196 0.4814631
## 2197 0.4814408
## 2198 0.4814203
## 2199 0.4814184
## 2200 0.4813859
## 2201 0.4813674
## 2202 0.4813462
## 2203 0.4813271
## 2204 0.4812683
## 2205 0.4812014
## 2206 0.4811923
## 2207 0.4811775
## 2208 0.4811276
## 2209 0.4811109
## 2210 0.4810994
## 2211 0.4810908
## 2212 0.4809451
## 2213 0.4809247
## 2214 0.4808876
## 2215 0.4808443
## 2216 0.4807926
## 2217 0.4807867
## 2218 0.4807531
## 2219 0.4806891
## 2220 0.4806813
## 2221 0.4806784
## 2222 0.4806697
## 2223 0.4805747
## 2224 0.4805512
## 2225 0.4804743
## 2226 0.4804598
## 2227 0.4803986
## 2228 0.4803462
## 2229 0.4803154
## 2230 0.4802894
## 2231 0.4802811
## 2232 0.4802764
## 2233 0.4802358
## 2234 0.4801222
## 2235 0.4800989
## 2236 0.4800793
## 2237 0.4800622
## 2238 0.4800297
## 2239 0.4799911
## 2240 0.4799908
## 2241 0.4799851
## 2242 0.4799373
## 2243 0.4798294
## 2244 0.4797716
## 2245 0.4797682
## 2246 0.4797631
## 2247 0.4797391
## 2248 0.4797124
## 2249 0.4796705
## 2250 0.4796664
## 2251 0.4796618
## 2252 0.4796039
## 2253 0.4795888
## 2254 0.4795835
## 2255 0.4794960
## 2256 0.4793986
## 2257 0.4793831
## 2258 0.4793292
## 2259 0.4793290
## 2260 0.4793103
## 2261 0.4792999
## 2262 0.4792981
## 2263 0.4791218
## 2264 0.4790184
## 2265 0.4790163
## 2266 0.4790048
## 2267 0.4790012
## 2268 0.4789782
## 2269 0.4789600
## 2270 0.4789310
## 2271 0.4788675
## 2272 0.4788405
## 2273 0.4788377
## 2274 0.4788267
## 2275 0.4787917
## 2276 0.4787771
## 2277 0.4787698
## 2278 0.4787271
## 2279 0.4787123
## 2280 0.4786923
## 2281 0.4786254
## 2282 0.4786221
## 2283 0.4785430
## 2284 0.4785047
## 2285 0.4785031
## 2286 0.4784925
## 2287 0.4783300
## 2288 0.4782714
## 2289 0.4782364
## 2290 0.4782140
## 2291 0.4781421
## 2292 0.4781314
## 2293 0.4780045
## 2294 0.4777156
## 2295 0.4777136
## 2296 0.4777136
## 2297 0.4776968
## 2298 0.4776119
## 2299 0.4775670
## 2300 0.4774845
## 2301 0.4774747
## 2302 0.4773331
## 2303 0.4771167
## 2304 0.4770998
## 2305 0.4770992
## 2306 0.4770924
## 2307 0.4770701
## 2308 0.4770680
## 2309 0.4768757
## 2310 0.4768663
## 2311 0.4768446
## 2312 0.4768206
## 2313 0.4767898
## 2314 0.4767484
## 2315 0.4767376
## 2316 0.4766363
## 2317 0.4765423
## 2318 0.4765363
## 2319 0.4765076
## 2320 0.4764464
## 2321 0.4763594
## 2322 0.4763396
## 2323 0.4761395
## 2324 0.4761254
## 2325 0.4761014
## 2326 0.4760906
## 2327 0.4760885
## 2328 0.4760716
## 2329 0.4760449
## 2330 0.4760255
## 2331 0.4760070
## 2332 0.4758927
## 2333 0.4758770
## 2334 0.4758601
## 2335 0.4758322
## 2336 0.4757809
## 2337 0.4756723
## 2338 0.4756186
## 2339 0.4756112
## 2340 0.4755935
## 2341 0.4755621
## 2342 0.4754968
## 2343 0.4753068
## 2344 0.4752886
## 2345 0.4752739
## 2346 0.4752347
## 2347 0.4751499
## 2348 0.4750801
## 2349 0.4749712
## 2350 0.4746951
## 2351 0.4746187
## 2352 0.4742925
## 2353 0.4741859
## 2354 0.4740891
## 2355 0.4740748
## 2356 0.4738935
## 2357 0.4738310
## 2358 0.4737851
## 2359 0.4737359
## 2360 0.4736340
## 2361 0.4735562
## 2362 0.4735497
## 2363 0.4735216
## 2364 0.4734928
## 2365 0.4734541
## 2366 0.4733735
## 2367 0.4733606
## 2368 0.4733208
## 2369 0.4732941
## 2370 0.4731925
## 2371 0.4730473
## 2372 0.4730137
## 2373 0.4727901
## 2374 0.4727867
## 2375 0.4726492
## 2376 0.4724684
## 2377 0.4723389
## 2378 0.4721882
## 2379 0.4720383
## 2380 0.4718814
## 2381 0.4718577
## 2382 0.4718409
## 2383 0.4716337
## 2384 0.4715985
## 2385 0.4714063
## 2386 0.4713055
## 2387 0.4712029
## 2388 0.4709909
## 2389 0.4709130
## 2390 0.4708419
## 2391 0.4708311
## 2392 0.4707836
## 2393 0.4707714
## 2394 0.4706128
## 2395 0.4705342
## 2396 0.4697792
## 2397 0.4696033
## 2398 0.4692183
## 2399 0.4689411
## 2400 0.4687771
## 2401 0.4686669
## 2402 0.4683205
## 2403 0.4683159
## 2404 0.4682157
## 2405 0.4680899
## 2406 0.4676354
## 2407 0.4673140
## 2408 0.4668370
## 2409 0.4667699
## 2410 0.4666180
## 2411 0.4656530
## 2412 0.4655330
## 2413 0.4654782
## 2414 0.4654592
## 2415 0.4652596
## 2416 0.4648587
## 2417 0.4647793
## 2418 0.4645829
## 2419 0.4637693
## 2420 0.4633393
## 2421 0.4630872
## 2422 0.4625594
## 2423 0.4625406
## 2424 0.4614034
## 2425 0.4600568
## 2426 0.4598886
## 2427 0.4594149
## 2428 0.4583686
## 2429 0.4578879
## 2430 0.4572674
## 2431 0.4566598
## 2432 0.4549509
## 2433 0.4517763
## 2434 0.4517419
## 2435 0.4512033
## 2436 0.4489315
## 2437 0.4426205
## 2438 0.4198447
Notice Sussex County in Virginia is more than two thirds male: this is because of two men’s prisons in the county.
Now that you know how to transform your data, you’ll want to know more about how to aggregate your data to make it more interpretable. You’ll learn a number of functions you can use to take many observations in your data and summarize them, including count, group_by, summarize, ungroup, and top_n.
The counties dataset contains columns for region, state, population, and the number of citizens, which we selected and saved as the counties_selected table. In this exercise, you’ll focus on the region column.
counties_selected <- counties %>%
select(region, state, population, citizens)
counties_selected %>%
count(region, sort=TRUE)
## # A tibble: 4 x 2
## region n
## <fct> <int>
## 1 South 1420
## 2 North Central 1055
## 3 West 448
## 4 Northeast 218
Since the results have been arranged, you can see that the South has the greatest number of counties.
You can weigh your count by particular variables rather than finding the number of counties. In this case, you’ll find the number of citizens in each state.
counties_selected <- counties %>%
select(region, state, population, citizens)
counties_selected %>%
count(state, wt=citizens, sort=TRUE)
## # A tibble: 50 x 2
## state n
## <fct> <int>
## 1 California 24280349
## 2 Texas 16864962
## 3 Florida 13933052
## 4 New York 13531404
## 5 Pennsylvania 9710416
## 6 Illinois 8979999
## 7 Ohio 8709050
## 8 Michigan 7380136
## 9 North Carolina 7107998
## 10 Georgia 6978660
## # ... with 40 more rows
From our result, we can see that California is the state with the most citizens.
You can combine multiple verbs together to answer increasingly complicated questions of your data. For example: “What are the US states where the most people walk to work?”
You’ll use the walk column, which offers a percentage of people in each county that walk to work, to add a new column and count based on it.
counties_selected <- counties %>%
select(state, population, walk)
counties_selected %>%
# Add population_walk containing the total number of people who walk to work
mutate(population_walk = population * walk / 100) %>%
# Count weighted by the new column
count(state, wt = population_walk, sort=TRUE)
## # A tibble: 50 x 2
## state n
## <fct> <dbl>
## 1 New York 1237938.
## 2 California 1017964.
## 3 Pennsylvania 505397.
## 4 Texas 430793.
## 5 Illinois 400346.
## 6 Massachusetts 316765.
## 7 Florida 284723.
## 8 New Jersey 273047.
## 9 Ohio 266911.
## 10 Washington 239764.
## # ... with 40 more rows
We can see that while California had the largest total population, New York state has the largest number of people who walk to work.
The summarize() verb is very useful for collapsing a large dataset into a single observation.
counties_selected <- counties %>%
select(county, population, income, unemployment)
counties_selected %>%
summarise(min_population = min(population), max_unemployment = max(unemployment), average_income = mean(income))
## min_population max_unemployment average_income
## 1 85 29.4 NA
If we wanted to take this a step further, we could use filter() to determine the specific counties that returned the value for min_population and max_unemployment.
counties_selected %>%
filter(population == min(population))
## county population income unemployment
## 1 Kalawao 85 66250 0
counties_selected %>%
filter(unemployment == max(unemployment))
## county population income unemployment
## 1 Corson 4149 31676 29.4
Another interesting column is land_area, which shows the land area in square miles. Here, you’ll summarize both population and land area by state, with the purpose of finding the density (in people per square miles).
counties_selected <- counties %>%
select(state, county, population, land_area)
counties_selected %>%
group_by(state) %>%
summarize(total_area = sum(land_area),
total_population = sum(population)) %>%
mutate(density = total_population / total_area) %>%
arrange(desc(density))
## # A tibble: 50 x 4
## state total_area total_population density
## <fct> <dbl> <int> <dbl>
## 1 New Jersey 7354. 8904413 1211.
## 2 Rhode Island 1034. 1053661 1019.
## 3 Massachusetts 7800. 6705586 860.
## 4 Connecticut 4842. 3593222 742.
## 5 Maryland 9707. 5930538 611.
## 6 Delaware 1949. 926454 475.
## 7 New York 47126. 19673174 417.
## 8 Florida 53625. 19645772 366.
## 9 Pennsylvania 44743. 12779559 286.
## 10 Ohio 40861. 11575977 283.
## # ... with 40 more rows
Looks like New Jersey and Rhode Island are the “most crowded” of the US states, with more than a thousand people per square mile.
You can group by multiple columns instead of grouping by one. Here, you’ll practice aggregating by state and region, and notice how useful it is for performing multiple aggregations in a row.
counties_selected <- counties %>%
select(region, state, county, population)
counties_selected %>%
group_by(region, state) %>%
summarize(total_pop = sum(population)) %>%
summarize(average_pop = mean(total_pop), median_pop = median(total_pop))
## # A tibble: 4 x 3
## region average_pop median_pop
## <fct> <dbl> <dbl>
## 1 North Central 5628866. 5580644
## 2 Northeast 5600438. 2461161
## 3 South 7369565. 4804098
## 4 West 5723364. 2798636
It looks like the South has the highest average_pop of 7370486, while the North Central region has the highest median_pop of 5580644.
Previously, you used the walk column, which offers a percentage of people in each county that walk to work, to add a new column and count to find the total number of people who walk to work in each county.
Now, you’re interested in finding the county within each region with the highest percentage of citizens who walk to work.
counties_selected <- counties %>%
select(region, state, county, metro, population, walk)
counties_selected %>%
group_by(region) %>%
top_n(1, walk)
## # A tibble: 4 x 6
## # Groups: region [4]
## region state county metro population walk
## <fct> <fct> <fct> <fct> <int> <dbl>
## 1 West Alaska Aleutians East Borough Nonmetro 3304 71.2
## 2 Northeast New York New York Metro 1629507 20.7
## 3 North Central North Dakota McIntosh Nonmetro 2759 17.5
## 4 South Virginia Lexington city Nonmetro 7071 31.7
Notice that three of the places lots of people walk to work are low-population nonmetro counties, but that New York City also pops up!
You’ve been learning to combine multiple dplyr verbs together. Here, you’ll combine group_by(), summarize(), and top_n() to find the state in each region with the highest income.
When you group by multiple columns and then summarize, it’s important to remember that the summarize “peels off” one of the groups, but leaves the rest on. For example, if you group_by(X, Y) then summarize, the result will still be grouped by X.
counties_selected <- counties %>%
select(region, state, county, population, income)
counties_selected %>%
group_by(region, state) %>%
# Calculate average income
summarise(average_income = mean(income)) %>%
# Find the highest income state in each region
top_n(1, average_income)
## # A tibble: 4 x 3
## # Groups: region [4]
## region state average_income
## <fct> <fct> <dbl>
## 1 North Central North Dakota 55575.
## 2 Northeast New Jersey 73014.
## 3 South Maryland 69200.
## 4 West Hawaii 64879
From our results, we can see that New Jersey in the Northeast is the state with the highest average_income of 73014.
In this chapter, you’ve learned to use five dplyr verbs related to aggregation: count(), group_by(), summarize(), ungroup(), and top_n(). In this exercise, you’ll use all of them to answer a question: In how many states do more people live in metro areas than non-metro areas?
Recall that the metro column has one of the two values “Metro” (for high-density city areas) or “Nonmetro” (for suburban and country areas).
counties_selected <- counties %>%
select(state, metro, population)
counties_selected %>%
group_by(state, metro) %>%
summarize(total_pop = sum(population)) %>%
top_n(1, total_pop) %>%
ungroup(total_pop) %>%
count(metro)
## # A tibble: 1 x 2
## metro n
## <fct> <int>
## 1 "" 50
Notice that 44 states have more people living in Metro areas, and 6 states have more people living in Nonmetro areas.
Learn advanced methods to select and transform columns. Also learn about select helpers, which are functions that specify criteria for columns you want to choose, as well as the rename and transmute verbs.
Using the select verb, we can answer interesting questions about our dataset by focusing in on related groups of verbs. The colon (:) is useful for getting many columns at a time.
glimpse(counties)
## Observations: 3,141
## Variables: 40
## $ census_id <int> 1001, 1003, 1005, 1007, 1009, 1011, 1013, 1015, ...
## $ state <fct> Alabama, Alabama, Alabama, Alabama, Alabama, Ala...
## $ county <fct> Autauga, Baldwin, Barbour, Bibb, Blount, Bullock...
## $ region <fct> South, South, South, South, South, South, South,...
## $ metro <fct> , , , , , , , , , , , , , , , , , , , , , , , , ,
## $ population <int> 55221, 195121, 26932, 22604, 57710, 10678, 20354...
## $ men <int> 26745, 95314, 14497, 12073, 28512, 5660, 9502, 5...
## $ women <int> 28476, 99807, 12435, 10531, 29198, 5018, 10852, ...
## $ hispanic <dbl> 2.6, 4.5, 4.6, 2.2, 8.6, 4.4, 1.2, 3.5, 0.4, 1.5...
## $ white <dbl> 75.8, 83.1, 46.2, 74.5, 87.9, 22.2, 53.3, 73.0, ...
## $ black <dbl> 18.5, 9.5, 46.7, 21.4, 1.5, 70.7, 43.8, 20.3, 40...
## $ native <dbl> 0.4, 0.6, 0.2, 0.4, 0.3, 1.2, 0.1, 0.2, 0.2, 0.6...
## $ asian <dbl> 1.0, 0.7, 0.4, 0.1, 0.1, 0.2, 0.4, 0.9, 0.8, 0.3...
## $ pacific <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...
## $ citizens <int> 40725, 147695, 20714, 17495, 42345, 8057, 15581,...
## $ income <int> 51281, 50254, 32964, 38678, 45813, 31938, 32229,...
## $ income_err <int> 2391, 1263, 2973, 3995, 3141, 5884, 1793, 925, 2...
## $ income_per_cap <int> 24974, 27317, 16824, 18431, 20532, 17580, 18390,...
## $ income_per_cap_err <int> 1080, 711, 798, 1618, 708, 2055, 714, 489, 1366,...
## $ poverty <dbl> 12.9, 13.4, 26.7, 16.8, 16.7, 24.6, 25.4, 20.5, ...
## $ child_poverty <dbl> 18.6, 19.2, 45.3, 27.9, 27.2, 38.4, 39.2, 31.6, ...
## $ professional <dbl> 33.2, 33.1, 26.8, 21.5, 28.5, 18.8, 27.5, 27.3, ...
## $ service <dbl> 17.0, 17.7, 16.1, 17.9, 14.1, 15.0, 16.6, 17.7, ...
## $ office <dbl> 24.2, 27.1, 23.1, 17.8, 23.9, 19.7, 21.9, 24.2, ...
## $ construction <dbl> 8.6, 10.8, 10.8, 19.0, 13.5, 20.1, 10.3, 10.5, 1...
## $ production <dbl> 17.1, 11.2, 23.1, 23.7, 19.9, 26.4, 23.7, 20.4, ...
## $ drive <dbl> 87.5, 84.7, 83.8, 83.2, 84.9, 74.9, 84.5, 85.3, ...
## $ carpool <dbl> 8.8, 8.8, 10.9, 13.5, 11.2, 14.9, 12.4, 9.4, 11....
## $ transit <dbl> 0.1, 0.1, 0.4, 0.5, 0.4, 0.7, 0.0, 0.2, 0.2, 0.2...
## $ walk <dbl> 0.5, 1.0, 1.8, 0.6, 0.9, 5.0, 0.8, 1.2, 0.3, 0.6...
## $ other_transp <dbl> 1.3, 1.4, 1.5, 1.5, 0.4, 1.7, 0.6, 1.2, 0.4, 0.7...
## $ work_at_home <dbl> 1.8, 3.9, 1.6, 0.7, 2.3, 2.8, 1.7, 2.7, 2.1, 2.5...
## $ mean_commute <dbl> 26.5, 26.4, 24.1, 28.8, 34.9, 27.5, 24.6, 24.1, ...
## $ employed <int> 23986, 85953, 8597, 8294, 22189, 3865, 7813, 474...
## $ private_work <dbl> 73.6, 81.5, 71.8, 76.8, 82.0, 79.5, 77.4, 74.1, ...
## $ public_work <dbl> 20.9, 12.3, 20.8, 16.1, 13.5, 15.1, 16.2, 20.8, ...
## $ self_employed <dbl> 5.5, 5.8, 7.3, 6.7, 4.2, 5.4, 6.2, 5.0, 2.8, 7.9...
## $ family_work <dbl> 0.0, 0.4, 0.1, 0.4, 0.4, 0.0, 0.2, 0.1, 0.0, 0.5...
## $ unemployment <dbl> 7.6, 7.5, 17.6, 8.3, 7.7, 18.0, 10.9, 12.3, 8.9,...
## $ land_area <dbl> 594.44, 1589.78, 884.88, 622.58, 644.78, 622.81,...
counties %>%
# Select state, county, population, and industry-related columns
select(state, county, population, professional:production) %>%
# Arrange service in descending order
arrange(desc(service))
Notice that when you select a group of related variables, it’s easy to find the insights you’re looking for.
In the video you learned about the select helper starts_with(). Another select helper is ends_with(), which finds the columns that end with a particular string.
counties %>%
# Select the state, county, population, and those ending with "work"
select(state, county, population, ends_with("work")) %>%
# Filter for counties that have at least 50% of people engaged in public work
filter(public_work >= 50)
## state county population private_work public_work
## 1 Alaska Kusilvak Census Area 7914 45.4 53.8
## 2 Alaska Lake and Peninsula Borough 1474 42.2 51.6
## 3 Alaska Yukon-Koyukuk Census Area 5644 33.3 61.7
## 4 California Lassen 32645 42.6 50.5
## 5 Hawaii Kalawao 85 25.0 64.1
## 6 North Dakota Sioux 4380 32.9 56.8
## 7 South Dakota Oglala Lakota 14153 29.5 66.2
## 8 South Dakota Todd 9942 34.4 55.0
## 9 Wisconsin Menominee 4451 36.8 59.1
## family_work
## 1 0.3
## 2 0.2
## 3 0.0
## 4 0.1
## 5 0.0
## 6 0.1
## 7 0.0
## 8 0.8
## 9 0.4
It looks like only a few counties have more than half the population working for the government.
The rename() verb is often useful for changing the name of a column that comes out of another verb, such ascount(). In this exercise, you’ll rename the n column from count() (which you learned about in Chapter 2) to something more descriptive.
# Rename the n column to num_counties
counties %>%
count(state) %>%
rename(num_counties = n)
## # A tibble: 50 x 2
## state num_counties
## <fct> <int>
## 1 Alabama 67
## 2 Alaska 29
## 3 Arizona 15
## 4 Arkansas 75
## 5 California 58
## 6 Colorado 64
## 7 Connecticut 8
## 8 Delaware 3
## 9 Florida 67
## 10 Georgia 159
## # ... with 40 more rows
Notice the difference between column names in the output from the first step to the second step. Don’t forget, using rename() isn’t the only way to choose a new name for a column!
rename() isn’t the only way you can choose a new name for a column: you can also choose a name as part of a select().
# Select state, county, and poverty as poverty_rate
counties %>%
select(state, county, poverty_rate = poverty)
As you can see, we were able to select the four columns of interest from our dataset, and rename one of those columns, using only the select() verb!
Source: DataCamp
Recall, you can think of transmute() as a combination of select() and mutate(), since you are getting back a subset of columns, but you are transforming and changing them at the same time.
As you learned in the video, the transmute verb allows you to control which variables you keep, which variables you calculate, and which variables you drop.
counties %>%
# Keep the state, county, and populations columns, and add a density column
transmute(state, county, population, density = population / land_area) %>%
# Filter for counties with a population greater than one million
filter(population > 1000000) %>%
# Sort density in ascending order
arrange(density)
## state county population density
## 1 California San Bernardino 2094769 104.4411
## 2 Nevada Clark 2035572 257.9472
## 3 California Riverside 2298032 318.8841
## 4 Arizona Maricopa 4018143 436.7480
## 5 Florida Palm Beach 1378806 699.9868
## 6 California San Diego 3223096 766.1943
## 7 Washington King 2045756 966.9999
## 8 Texas Travis 1121645 1132.7459
## 9 Florida Hillsborough 1302884 1277.0743
## 10 Florida Orange 1229039 1360.4142
## 11 Florida Miami-Dade 2639042 1390.6382
## 12 Michigan Oakland 1229503 1417.0332
## 13 California Santa Clara 1868149 1448.0653
## 14 Utah Salt Lake 1078958 1453.5728
## 15 Texas Bexar 1825502 1472.3928
## 16 California Sacramento 1465832 1519.5638
## 17 Florida Broward 1843152 1523.5305
## 18 California Contra Costa 1096068 1530.9495
## 19 New York Suffolk 1501373 1646.1521
## 20 Pennsylvania Allegheny 1231145 1686.3152
## 21 Massachusetts Middlesex 1556116 1902.7610
## 22 Missouri St. Louis 1001327 1971.8925
## 23 Maryland Montgomery 1017859 2071.9776
## 24 California Alameda 1584983 2144.7092
## 25 Minnesota Hennepin 1197776 2163.6518
## 26 Texas Tarrant 1914526 2216.8873
## 27 Ohio Franklin 1215761 2284.4492
## 28 California Los Angeles 10038388 2473.8011
## 29 Texas Harris 4356362 2557.3309
## 30 Ohio Cuyahoga 1263189 2762.9410
## 31 Texas Dallas 2485003 2852.1291
## 32 Virginia Fairfax 1128722 2886.9785
## 33 Michigan Wayne 1778969 2906.4322
## 34 California Orange 3116069 3941.5472
## 35 New York Nassau 1354612 4757.6988
## 36 Illinois Cook 5236393 5539.2223
## 37 Pennsylvania Philadelphia 1555072 11596.3609
## 38 New York Queens 2301139 21202.7919
## 39 New York Bronx 1428357 33927.7197
## 40 New York Kings 2595259 36645.8486
## 41 New York New York 1629507 71375.6899
Looks like San Bernadino is the lowest density county with a population about one million.
counties %>%
# Keep the state, county, and populations columns, and add a density column
transmute(state, county, population, density = population / land_area) %>%
# Filter for counties with a population greater than one million
filter(population > 1000000) %>%
# Sort density in ascending order
arrange(density)
## state county population density
## 1 California San Bernardino 2094769 104.4411
## 2 Nevada Clark 2035572 257.9472
## 3 California Riverside 2298032 318.8841
## 4 Arizona Maricopa 4018143 436.7480
## 5 Florida Palm Beach 1378806 699.9868
## 6 California San Diego 3223096 766.1943
## 7 Washington King 2045756 966.9999
## 8 Texas Travis 1121645 1132.7459
## 9 Florida Hillsborough 1302884 1277.0743
## 10 Florida Orange 1229039 1360.4142
## 11 Florida Miami-Dade 2639042 1390.6382
## 12 Michigan Oakland 1229503 1417.0332
## 13 California Santa Clara 1868149 1448.0653
## 14 Utah Salt Lake 1078958 1453.5728
## 15 Texas Bexar 1825502 1472.3928
## 16 California Sacramento 1465832 1519.5638
## 17 Florida Broward 1843152 1523.5305
## 18 California Contra Costa 1096068 1530.9495
## 19 New York Suffolk 1501373 1646.1521
## 20 Pennsylvania Allegheny 1231145 1686.3152
## 21 Massachusetts Middlesex 1556116 1902.7610
## 22 Missouri St. Louis 1001327 1971.8925
## 23 Maryland Montgomery 1017859 2071.9776
## 24 California Alameda 1584983 2144.7092
## 25 Minnesota Hennepin 1197776 2163.6518
## 26 Texas Tarrant 1914526 2216.8873
## 27 Ohio Franklin 1215761 2284.4492
## 28 California Los Angeles 10038388 2473.8011
## 29 Texas Harris 4356362 2557.3309
## 30 Ohio Cuyahoga 1263189 2762.9410
## 31 Texas Dallas 2485003 2852.1291
## 32 Virginia Fairfax 1128722 2886.9785
## 33 Michigan Wayne 1778969 2906.4322
## 34 California Orange 3116069 3941.5472
## 35 New York Nassau 1354612 4757.6988
## 36 Illinois Cook 5236393 5539.2223
## 37 Pennsylvania Philadelphia 1555072 11596.3609
## 38 New York Queens 2301139 21202.7919
## 39 New York Bronx 1428357 33927.7197
## 40 New York Kings 2595259 36645.8486
## 41 New York New York 1629507 71375.6899
We’ve learned a number of new verbs in this chapter that you can use to modify and change the variables you have.
rename: Leaves the column you don’t mention alone; doesn’t allow you to calculate or change values.transmute: Must mention all the columns you keep; allows you to calculate or change values.mutate: Leaves the columns you don’t mention alone; allows you to calculate or change values.Let’s continue practising using the verbs to gain a better understanding of the differences between them.
In this chapter you’ve learned about the four verbs: select, mutate, transmute, and rename. Here, you’ll choose the appropriate verb for each situation. You won’t need to change anything inside the parentheses.
# Change the name of the unemployment column
counties %>%
rename(unemployment_rate = unemployment)
# Keep the state and county columns, and the columns containing poverty
counties %>%
select(state, county, contains("poverty"))
# Calculate the fraction_women column without dropping the other columns
counties %>%
mutate(fraction_women = women / population)
# Keep only the state, county, and employment_rate columns
counties %>%
transmute(state, county, employment_rate = employed / population)
Now you know which variable to choose depending on whether you want to keep, drop, rename, or change a variable in the dataset.
Work with a new dataset that represents the names of babies born in the United States each year. Learn how to use grouped mutates and window functions to ask and answer more complex questions about your data. And use a combination of dplyr and ggplot2 to make interesting graphs to further explore your data.
library(babynames)
## Warning: package 'babynames' was built under R version 3.5.3
library(dplyr)
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.5.3
babynames <- babynames %>%
select(year, name, number = n) %>%
group_by(year, name) %>%
summarise(number = sum(number)) %>%
# ungroup() %>%
arrange(year, name)
babynames
## # A tibble: 1,756,284 x 3
## # Groups: year [138]
## year name number
## <dbl> <chr> <int>
## 1 1880 Aaron 102
## 2 1880 Ab 5
## 3 1880 Abbie 71
## 4 1880 Abbott 5
## 5 1880 Abby 6
## 6 1880 Abe 50
## 7 1880 Abel 9
## 8 1880 Abigail 12
## 9 1880 Abner 27
## 10 1880 Abraham 81
## # ... with 1,756,274 more rows
The dplyr verbs you’ve learned are useful for exploring data. For instance, you could find out the most common names in a particular year.
babynames %>%
# Filter for the year 1990
filter(year == 1990) %>%
# Sort the number column in descending order
arrange(desc(number))
## # A tibble: 22,678 x 3
## # Groups: year [1]
## year name number
## <dbl> <chr> <int>
## 1 1990 Michael 65560
## 2 1990 Christopher 52520
## 3 1990 Jessica 46615
## 4 1990 Ashley 45797
## 5 1990 Matthew 44925
## 6 1990 Joshua 43382
## 7 1990 Brittany 36650
## 8 1990 Amanda 34504
## 9 1990 Daniel 33963
## 10 1990 David 33862
## # ... with 22,668 more rows
It looks like the most common names for babies born in the US in 1990 were Michael, Christopher, and Jessica.
You saw that you could use filter() and arrange() to find the most common names in one year. However, you could also use group_by and top_n to find the most common name in every year.
# Find the most common name in each year
babynames %>%
group_by(year) %>%
top_n(1, number)
## # A tibble: 138 x 3
## # Groups: year [138]
## year name number
## <dbl> <chr> <int>
## 1 1880 John 9701
## 2 1881 John 8795
## 3 1882 John 9597
## 4 1883 John 8934
## 5 1884 John 9428
## 6 1885 Mary 9166
## 7 1886 Mary 9921
## 8 1887 Mary 9935
## 9 1888 Mary 11804
## 10 1889 Mary 11689
## # ... with 128 more rows
It looks like John was the most common name in 1880, and Mary was the most common name for a while after that.
The dplyr package is very useful for exploring data, but it’s especially useful when combined with other tidyverse packages like ggplot2. (As of tidyverse 1.3.0, the following packages are included in the core tidyverse: dplyr, ggplot2, tidyr, readr, purrr, tibble, stringr, forcats. To make sure you are able to access the package, install all the packages in the tidyverse by running install.packages("tidyverse"), then run library(tidyverse) to load the core tidyverse and make it available in your current R session.)
# Filter for the names Steven, Thomas, and Matthew
selected_names <- babynames %>%
filter(name %in% c("Steven", "Thomas", "Matthew"))
# Plot the names using a different color for each name
ggplot(selected_names, aes(x = year, y = number, color = name)) +
geom_line()
It looks like names like Steven and Thomas were common in the 1950s, but Matthew became common more recently.
In an earlier video, you learned how to filter for a particular name to determine the frequency of that name over time. Now, you’re going to explore which year each name was the most common.
To do this, you’ll be combining the grouped mutate approach with a top_n.
# Calculate the fraction of people born each year with the same name
babynames %>%
group_by(year) %>%
mutate(year_total = sum(number)) %>%
ungroup() %>%
mutate(fraction = number / year_total) %>%
# Find the year each name is most common
group_by(name) %>%
top_n(1, fraction)
## # A tibble: 97,310 x 5
## # Groups: name [97,310]
## year name number year_total fraction
## <dbl> <chr> <int> <int> <dbl>
## 1 1880 Abbott 5 201484 0.0000248
## 2 1880 Abe 50 201484 0.000248
## 3 1880 Adelbert 28 201484 0.000139
## 4 1880 Adella 26 201484 0.000129
## 5 1880 Agustus 5 201484 0.0000248
## 6 1880 Albert 1493 201484 0.00741
## 7 1880 Albertus 5 201484 0.0000248
## 8 1880 Alcide 7 201484 0.0000347
## 9 1880 Alonzo 122 201484 0.000606
## 10 1880 Amos 128 201484 0.000635
## # ... with 97,300 more rows
Notice that the results are grouped by year, then name, so the first few entries are names that were most popular in the 1880’s that start with the letter A.
In the video, you learned how you could group by the year and use mutate() to add a total for that year.
In these exercises, you’ll learn to normalize by a different, but also interesting metric: you’ll divide each name by the maximum for that name. This means that every name will peak at 1.
Once you add new columns, the result will still be grouped by name. This splits it into 48,000 groups, which actually makes later steps like mutate slower.
babynames %>%
group_by(name) %>%
mutate(name_total = sum(number),
name_max = max(number)) %>%
# Ungroup the table
ungroup() %>%
# Add the fraction_max column containing the number by the name maximum
mutate(fraction_max = number / name_max)
## # A tibble: 1,756,284 x 6
## year name number name_total name_max fraction_max
## <dbl> <chr> <int> <int> <int> <dbl>
## 1 1880 Aaron 102 579589 15411 0.00662
## 2 1880 Ab 5 362 41 0.122
## 3 1880 Abbie 71 21716 536 0.132
## 4 1880 Abbott 5 1020 59 0.0847
## 5 1880 Abby 6 57756 2048 0.00293
## 6 1880 Abe 50 9158 280 0.179
## 7 1880 Abel 9 50236 3245 0.00277
## 8 1880 Abigail 12 357031 15948 0.000752
## 9 1880 Abner 27 7641 202 0.134
## 10 1880 Abraham 81 88852 2575 0.0315
## # ... with 1,756,274 more rows
This tells you, for example, that the name Abe was at 18.5% of its peak in the year 1880.
You picked a few names and calculated each of them as a fraction of their peak. This is a type of “normalizing” a name, where you’re focused on the relative change within each name rather than the overall popularity of the name.
In this exercise, you’ll visualize the normalized popularity of each name. Your work from the previous exercise, names_normalized, has been provided for you.
names_normalized <- babynames %>%
group_by(name) %>%
mutate(name_total = sum(number),
name_max = max(number)) %>%
ungroup() %>%
mutate(fraction_max = number / name_max)
# Filter for the names Steven, Thomas, and Matthew
names_filtered <- names_normalized %>%
filter(name %in% c("Steven", "Thomas", "Matthew"))
# Visualize these names over time
ggplot(names_filtered, aes(x = year, y = fraction_max, color = name)) +
geom_line()
As you can see, the line for each name hits a peak at 1, although the peak year differs for each name.
The code shown at the end of the video (finding the variable difference for all names) contains a mistake. Can you spot it? Below is the corrected code.
# As we did before, but now naming it babynames_fraction
babynames_fraction <- babynames %>%
group_by(year) %>%
mutate(year_total = sum(number)) %>%
ungroup() %>%
mutate(fraction = number / year_total)
# Just for Matthew
babynames_fraction %>%
filter(name == "Matthew") %>%
arrange(year) %>%
# Display change in prevalence from year to year
mutate(difference = fraction - lag(fraction)) %>%
# Arange in descending order
arrange(desc(difference))
## # A tibble: 138 x 6
## year name number year_total fraction difference
## <dbl> <chr> <int> <int> <dbl> <dbl>
## 1 1981 Matthew 43531 3459182 0.0126 0.00154
## 2 1983 Matthew 50531 3462826 0.0146 0.00138
## 3 1971 Matthew 22653 3432585 0.00660 0.000982
## 4 1967 Matthew 13629 3395130 0.00401 0.000894
## 5 1973 Matthew 24658 3017412 0.00817 0.000835
## 6 1974 Matthew 27332 3040409 0.00899 0.000818
## 7 1978 Matthew 34468 3174268 0.0109 0.000741
## 8 1972 Matthew 23066 3143627 0.00734 0.000738
## 9 1968 Matthew 15915 3378876 0.00471 0.000696
## 10 1982 Matthew 46333 3507664 0.0132 0.000625
## # ... with 128 more rows
# For all names
babynames_fraction %>%
group_by(name) %>%
# Display change in prevalence from year to year
mutate(difference = fraction - lag(fraction)) %>%
# Arange in descending order
arrange(name, year)
## # A tibble: 1,756,284 x 6
## # Groups: name [97,310]
## year name number year_total fraction difference
## <dbl> <chr> <int> <int> <dbl> <dbl>
## 1 2007 Aaban 5 3994007 0.00000125 NA
## 2 2009 Aaban 6 3815638 0.00000157 3.21e-7
## 3 2010 Aaban 9 3690700 0.00000244 8.66e-7
## 4 2011 Aaban 11 3651914 0.00000301 5.74e-7
## 5 2012 Aaban 11 3650462 0.00000301 1.20e-9
## 6 2013 Aaban 14 3637310 0.00000385 8.36e-7
## 7 2014 Aaban 16 3696311 0.00000433 4.80e-7
## 8 2015 Aaban 15 3688687 0.00000407 -2.62e-7
## 9 2016 Aaban 9 3652968 0.00000246 -1.60e-6
## 10 2017 Aaban 11 3546301 0.00000310 6.38e-7
## # ... with 1,756,274 more rows
In the video, you learned how to find the difference in the frequency of a baby name between consecutive years. What if instead of finding the difference, you wanted to find the ratio?
You’ll start with the babynames_fraction data already, so that you can consider the popularity of each name within each year.
babynames_ratio <- babynames_fraction %>%
# Arrange the data in order of name, then year
arrange(name, year) %>%
# Group the data by name
group_by(name) %>%
# Add a ratio column that contains the ratio between each year
mutate(ratio = fraction / lag(fraction))
Notice that the first observation for each name is missing a ratio, since there is no previous year.
Previously, you added a ratio column to describe the ratio of the frequency of a baby name between consecutive years to describe the changes in the popularity of a name. Now, you’ll look at a subset of that data, called babynames_ratios_filtered, to look further into the names that experienced the biggest jumps in popularity in consecutive years.
babynames_ratios_filtered <- babynames_ratio %>%
filter(fraction >= 0.00001)
babynames_ratios_filtered %>%
# Extract the largest ratio from each name
top_n(1, ratio) %>%
# Sort the ratio column in descending order
arrange(desc(ratio)) %>%
# Filter for fractions greater than or equal to 0.001
filter(fraction >= 0.001)
## # A tibble: 155 x 6
## # Groups: name [155]
## year name number year_total fraction ratio
## <dbl> <chr> <int> <int> <dbl> <dbl>
## 1 1957 Tammy 4398 4200007 0.00105 16.3
## 2 1912 Woodrow 1854 988064 0.00188 9.99
## 3 1931 Marlene 2599 2104071 0.00124 8.97
## 4 1898 Dewey 1219 381458 0.00320 6.48
## 5 2010 Bentley 4001 3690700 0.00108 6.21
## 6 1884 Grover 809 243462 0.00332 5.68
## 7 1984 Jenna 5898 3487820 0.00169 5.01
## 8 1991 Mariah 5200 3894329 0.00134 4.78
## 9 1943 Cheryl 2894 2822127 0.00103 4.75
## 10 1989 Ethan 4067 3843559 0.00106 4.37
## # ... with 145 more rows
Some of these can be interpreted: for example, Grover Cleveland was a president elected in 1884.
You’ll find all of these skills valuable is these other DataCamp courses:
It’s commonly said that data scientists spend 80% of their time cleaning and manipulating data and only 20% of their time analyzing it. The time spent cleaning is vital since analyzing dirty data can lead you to draw inaccurate conclusions.
In this course, you’ll learn how to clean dirty data. Using R, you’ll learn how to identify values that don’t look right and fix dirty data by converting data types, filling in missing values, and using fuzzy string matching. As you learn, you’ll brush up on your skills by working with real-world datasets, including bike-share trips, customer asset portfolios, and restaurant reviews—developing the skills you need to go from raw data to awesome insights as quickly and accurately as possible!
In this chapter, you’ll learn how to overcome some of the most common dirty data problems. You’ll convert data types, apply range constraints to remove future data points, and remove duplicated data points to avoid double-counting.
Errors can be introduced by typos and misspellings. Dirty data can the data science workflow before we even access the data, and if we don’t address these errors early on, they can follow us all the way through the workflow.
Errors appearing throughout the data science workflow. Source: DataCamp
A quick reminder of data type constraints:
Data types vs. data types in R. Source: DataCamp
Solution. Source: DataCamp
Correctly identifying what type your data is is one of the easiest ways to avoid hampering your analysis due to data type constraints in the long run.
Throughout this chapter, you’ll be working with San Francisco bike share ride data called bike_share_rides. It contains information on start and end stations of each trip, the trip duration, and some user information.
Before beginning to analyze any dataset, it’s important to take a look at the different types of columns you’ll be working with, which you can do using glimpse().
In this exercise, you’ll take a look at the data types contained in bike_share_rides and see how an incorrect data type can flaw your analysis.
dplyr was previously loaded, so we just load assertive. bike_share_rides is not available so code requiring it will not be loaded.
# Load assertive package
library(assertive)
# Glimpse at bike_share_rides
glimpse(bike_share_rides)
# Summary of user_birth_year
summary(bike_share_rides$user_birth_year)
Question
The summary statistics of user_birth_year don’t seem to offer much useful information about the different birth years in our dataset. Why do you think that is?
The user_birth_year column is not of the correct type and should be converted to a character.
The user_birth_year column has an infinite set of possible values and should be converted to a factor.
The user_birth_year column represents groupings of data and should be converted to a factor. [Correct]
# Convert user_birth_year to factor: user_birth_year_fct
bike_share_rides <- bike_share_rides %>%
mutate(user_birth_year_fct = as.factor(user_birth_year))
# Assert user_birth_year_fct is a factor
assert_is_factor(bike_share_rides$user_birth_year_fct)
# Summary of user_birth_year_fct
summary(bike_share_rides$user_birth_year_fct)
Looking at the new summary statistics, more riders were born in 1988 than any other year.
In the previous exercise, you were able to identify the correct data type and convert user_birth_year to the correct type, allowing you to extract counts that gave you a bit more insight into the dataset.
Another common dirty data problem is having extra bits like percent signs or periods in numbers, causing them to be read in as characters. In order to be able to crunch these numbers, the extra bits need to be removed and the numbers need to be converted from character to numeric. In this exercise, you’ll need to convert the duration column from character to numeric, but before this can happen, the word "minutes" needs to be removed from each value.
# Load stringr package
library(stringr)
## Warning: package 'stringr' was built under R version 3.5.3
bike_share_rides <- bike_share_rides %>%
# Remove 'minutes' from duration: duration_trimmed
mutate(duration_trimmed = str_remove(duration, "minutes"),
# Convert duration_trimmed to numeric: duration_mins
duration_mins = as.numeric(duration_trimmed))
# Glimpse at bike_share_rides
glimpse(bike_share_rides)
# Assert duration_mins is numeric
assert_is_numeric(bike_share_rides$duration_mins)
# Calculate mean duration
mean(bike_share_rides$duration_mins)
By removing characters and converting to a numeric type, you were able to figure out that the average ride duration is about 13 minutes - not bad for a city like San Francisco!
What to do about about values outside of a variable’s allowed range?
NA).6 on a 1 to 5 rating scale with 5.Values that are out of range can throw off an analysis, so it’s important to catch them early on. In this exercise, you’ll be examining the duration_min column more closely. Bikes are not allowed to be kept out for more than 24 hours, or 1440 minutes at a time, but issues with some of the bikes caused inaccurate recording of the time they were returned.
In this exercise, you’ll replace erroneous data with the range limit (1440 minutes), however, you could just as easily replace these values with NAs.
# Create breaks
breaks <- c(min(bike_share_rides$duration_min), 0, 1440, max(bike_share_rides$duration_min))
# Create a histogram of duration_min
ggplot(bike_share_rides, aes(duration_min)) +
geom_histogram(breaks = breaks)
# duration_min_const: replace vals of duration_min > 1440 with 1440
bike_share_rides <- bike_share_rides %>%
mutate(duration_min_const = replace(duration_min, duration_min > 1440, 1440))
# Make sure all values of duration_min_const are between 0 and 1440
assert_all_are_in_closed_range(bike_share_rides$duration_min_const, lower = 0, upper = 1440)
The method of replacing erroneous data with the range limit works well, but you could just as easily replace these values with NAs or something else instead.
Something has gone wrong and it looks like you have data with dates from the future, which is way outside of the date range you expected to be working with. To fix this, you’ll need to remove any rides from the dataset that have a date in the future. Before you can do this, the date column needs to be converted from a character to a Date. Having these as Date objects will make it much easier to figure out which rides are from the future, since R makes it easy to check if one Date object is before (<) or after (>) another.
# Convert date to Date type
bike_share_rides <- bike_share_rides %>%
mutate(date = as.Date(date))
# Make sure all dates are in the past
assert_all_are_in_past(bike_share_rides$date)
# Filter for rides that occurred before or on today's date
bike_share_rides_past <- bike_share_rides %>%
filter(date <= today())
# Make sure all dates from bike_share_rides_past are in the past
assert_all_are_in_past(bike_share_rides_past$date)
Handling data from the future like this is much easier than trying to verify the data’s correctness by time traveling.
Duplicate entries. Source: DataCamp
Source of duplicates. Source: DataCamp
You’ve been notified that an update has been made to the bike sharing data pipeline to make it more efficient, but that duplicates are more likely to be generated as a result. To make sure that you can continue using the same scripts to run your weekly analyses about ride statistics, you’ll need to ensure that any duplicates in the dataset are removed first.
When multiple rows of a data frame share the same values for all columns, they’re full duplicates of each other. Removing duplicates like this is important, since having the same value repeated multiple times can alter summary statistics like the mean and median. Each ride, including its ride_id should be unique.
# Count the number of full duplicates
sum(duplicated(bike_share_rides))
# Remove duplicates
bike_share_rides_unique <- distinct(bike_share_rides)
# Count the full duplicates in bike_share_rides_unique
sum(duplicated(bike_share_rides_unique))
Removing full duplicates will ensure that summary statistics aren’t altered by repeated data points.
Now that you’ve identified and removed the full duplicates, it’s time to check for partial duplicates. Partial duplicates are a bit tricker to deal with than full duplicates. In this exercise, you’ll first identify any partial duplicates and then practice the most common technique to deal with them, which involves dropping all partial duplicates, keeping only the first.
# Find duplicated ride_ids
bike_share_rides %>%
# Count the number of occurrences of each ride_id
count(ride_id) %>%
# Filter for rows with a count > 1
filter(n > 1)
# Remove full and partial duplicates
bike_share_rides_unique <- bike_share_rides %>%
# Only based on ride_id instead of all cols
distinct(ride_id, .keep_all = TRUE)
# Find duplicated ride_ids in bike_share_rides_unique
bike_share_rides_unique %>%
# Count the number of occurrences of each ride_id
count(ride_id) %>%
# Filter for rows with a count > 1
filter(n > 1)
It’s important to consider the data you’re working with before removing partial duplicates, since sometimes it’s expected that there will be partial duplicates in a dataset, such as if the same customer makes multiple purchases.
Another way of handling partial duplicates is to compute a summary statistic of the values that differ between partial duplicates, such as mean, median, maximum, or minimum. This can come in handy when you’re not sure how your data was collected and want an average, or if based on domain knowledge, you’d rather have too high of an estimate than too low of an estimate (or vice versa).
bike_share_rides %>%
# Group by ride_id and date
group_by(ride_id, date) %>%
# Add duration_min_avg column
mutate(duration_min_avg = mean(duration_min) ) %>%
# Remove duplicates based on ride_id and date, keep all cols
distinct(ride_id, date, .keep_all = TRUE) %>%
# Remove duration_min column
select(-duration_min)
Aggregation of partial duplicates allows you to keep some information about all data points instead of keeping information about just one data point.
Categorical and text data can often be some of the messiest parts of a dataset due to their unstructured nature. In this chapter, you’ll learn how to fix whitespace and capitalization inconsistencies in category labels, collapse multiple categories into one, and reformat strings for consistency.
In R, categories are stored as factors in R. They are stored as numbers, and each number has a corresponding label.
Factors. Source: DataCamp
Values that don’t belong. Source: DataCamp
How do we end up with values outside those allowed by our factors?
Source: DataCamp
Filtering joins are a type of join thaat keeps or removes observations from the first table, but doesn’t add any new columns.
Semi-joins vs. anti-joins. Source: DataCamp
Values that aren’t factors. Source: DataCamp
So far in the course, you’ve learned about a number of different problems you can run into when you have dirty data, including
It’s important to be able to correctly identify the type of problem you’re dealing with so that you can treat it correctly. In this exercise, you’ll practice identifying these problems by mapping dirty data scenarios to their constraint type.
Solution. Source: DataCamp
Being able to identify what kinds of errors are in your data is important so that you know how to go about fixing them.
Now that you’ve practiced identifying membership constraint problems, it’s time to fix these problems in a new dataset. Throughout this chapter, you’ll be working with a dataset called sfo_survey, containing survey responses from passengers taking flights from San Francisco International Airport (SFO). Participants were asked questions about the airport’s cleanliness, wait times, safety, and their overall satisfaction.
There were a few issues during data collection that resulted in some inconsistencies in the dataset. In this exercise, you’ll be working with the dest_size column, which categorizes the size of the destination airport that the passengers were flying to. A data frame called dest_sizes is available that contains all the possible destination sizes. Your mission is to find rows with invalid dest_sizes and remove them from the data frame.
# Count the number of occurrences of dest_size
sfo_survey %>%
count(dest_size)
# Find bad dest_size rows
sfo_survey %>%
# Join with dest_sizes data frame to get bad dest_size rows
anti_join(dest_sizes, by = "dest_size") %>%
# Select id, airline, destination, and dest_size cols
select(id, airline, destination, dest_size)
# Remove bad dest_size rows
sfo_survey %>%
# Join with dest_sizes
semi_join(dest_sizes, by = "dest_size") %>%
# Count the number of each dest_size
count(dest_size)
Anti-joins can help you identify the rows that are causing issues, and semi-joins can remove the issue-causing rows. In the next lesson, you’ll learn about other ways to deal with bad values so that you don’t have to lose rows of data.
Categorical data problems Source: DataCamp
In the video exercise, you learned about different kinds of inconsistencies that can occur within categories, making it look like a variable has more categories than it should.
In this exercise, you’ll continue working with the sfo_survey dataset. You’ll examine the dest_size column again as well as the cleanliness column and determine what kind of issues, if any, these two categorical variables face.
# Count dest_size
sfo_survey %>%
count(dest_size)
# Count cleanliness
sfo_survey %>%
count(cleanliness)
In the next exercise, you’ll fix these inconsistencies to get more accurate counts.
Now that you’ve identified that dest_size has whitespace inconsistencies and cleanliness has capitalization inconsistencies, you’ll use the new tools at your disposal to fix the inconsistent values in sfo_survey instead of removing the data points entirely, which could add bias to your dataset if more than 5% of the data points need to be dropped.
# Add new columns to sfo_survey
sfo_survey <- sfo_survey %>%
# dest_size_trimmed: dest_size without whitespace
mutate(dest_size_trimmed = str_trim(dest_size),
# cleanliness_lower: cleanliness converted to lowercase
cleanliness_lower = str_to_lower(cleanliness))
# Count values of dest_size_trimmed
sfo_survey %>%
count(dest_size_trimmed)
# Count values of cleanliness_lower
sfo_survey %>%
count(cleanliness_lower)
You were able to convert seven-category data into four-category data, which will help your analysis go more smoothly.
One of the tablets that participants filled out the sfo_survey on was not properly configured, allowing the response for dest_region to be free text instead of a dropdown menu. This resulted in some inconsistencies in the dest_region variable that you’ll need to correct in this exercise to ensure that the numbers you report to your boss are as accurate as possible.
# Count categories of dest_region
sfo_survey %>%
count(dest_region)
# Categories to map to Europe
europe_categories <- c("EU", "eur", "Europ")
# Add a new col dest_region_collapsed
sfo_survey %>%
# Map all categories in europe_categories to Europe
mutate(dest_region_collapsed = fct_collapse(dest_region,
Europe = europe_categories)) %>%
# Count categories of dest_region_collapsed
count(dest_region_collapsed)
You’ve reduced the number of categories from 12 to 9, and you can now be confident that 401 of the survey participants were heading to Europe.
Text data. Source: DataCamp
Unstructured data problems. Source: DataCamp
More complex text problems. Source: DataCamp
You’ve recently received some news that the customer support team wants to ask the SFO survey participants some follow-up questions. However, the auto-dialer that the call center uses isn’t able to parse all of the phone numbers since they’re all in different formats. After some investigation, you found that some phone numbers are written with hyphens (-) and some are written with parentheses ((,)). In this exercise, you’ll figure out which phone numbers have these issues so that you know which ones need fixing.
# Filter for rows with "-" in the phone column
sfo_survey %>%
filter(str_detect(phone, "-"))
# Filter for rows with "(" or ")" in the phone column
sfo_survey %>%
filter(str_detect(phone, fixed("(")) | str_detect(phone, fixed(")")))
Now that you’ve identified the inconsistencies in the phone column, it’s time to remove unnecessary characters to make the follow-up survey go as smoothly as possible.
In the last exercise, you saw that the phone column of sfo_data is plagued with unnecessary parentheses and hyphens. The customer support team has requested that all phone numbers be in the format “123 456 7890”. In this exercise, you’ll use your new stringr skills to fulfill this request.
# Remove parentheses from phone column
phone_no_parens <- sfo_survey$phone %>%
# Remove "("s
str_remove_all(fixed("(")) %>%
# Remove ")"s
str_remove_all(fixed(")"))
# Add phone_no_parens as column
sfo_survey %>%
mutate(sfo_survey, phone_no_parens)
# Add phone_no_parens as column
sfo_survey %>%
mutate(phone_no_parens = phone_no_parens,
# Replace all hyphens in phone_no_parens with spaces
phone_clean = str_replace_all(phone_no_parens, "-", " "))
Now that your phone numbers are all in a single format, the machines in the call center will be able to auto-dial the numbers, making it easier to ask participants follow-up questions.
The customer support team is grateful for your work so far, but during their first day of calling participants, they ran into some phone numbers that were invalid. In this exercise, you’ll remove any rows with invalid phone numbers so that these faulty numbers don’t keep slowing the team down.
# Check out the invalid numbers
sfo_survey %>%
filter(str_length(phone) != 12)
# Remove rows with invalid numbers
sfo_survey %>%
filter(str_length(phone) == 12)
Thanks to your savvy string skills, the follow-up survey will be done in no time!
In this chapter, you’ll dive into more advanced data cleaning problems, such as ensuring that weights are all written in kilograms instead of pounds. You’ll also gain invaluable skills that will help you verify that values have been added correctly and that missing values don’t negatively impact your analyses.
Uniformity. Source: DataCamp
Sources of uniformity issues. Source: DataCamp
What to do about a lack of uniformity? Source: DataCamp
Date uniformity. Source: DataCamp
Ambiguous dates. Source: DataCamp
In this chapter, you work at an asset management company and you’ll be working with the accounts dataset, which contains information about each customer, the amount in their account, and the date their account was opened. Your boss has asked you to calculate some summary statistics about the average value of each account and whether the age of the account is associated with a higher or lower account value. Before you can do this, you need to make sure that the accounts dataset you’ve been given doesn’t contain any uniformity problems. In this exercise, you’ll investigate the date_opened column and clean it up so that all the dates are in the same format.
# Check out the accounts data frame
head(accounts)
# Define the date formats
formats <- c("%Y-%m-%d", "%B %d, %Y")
# Convert dates to the same format
accounts %>%
mutate(date_opened_clean = parse_date_time(date_opened, orders = formats))
Now that the date_opened dates are in the same format, you’ll be able to use them for some plotting in the next exercise.
Now that your dates are in order, you’ll need to correct any unit differences. When you first plot the data, you’ll notice that there’s a group of very high values, and a group of relatively lower values. The bank has two different offices - one in New York, and one in Tokyo, so you suspect that the accounts managed by the Tokyo office are in Japanese yen instead of U.S. dollars. Luckily, you have a data frame called account_offices that indicates which office manages each customer’s account, so you can use this information to figure out which totals need to be converted from yen to dollars.
The formula to convert yen to dollars is USD = JPY / 104.
# Scatter plot of opening date and total amount
accounts %>%
ggplot(aes(x = date_opened, y = total)) +
geom_point()
# Left join accounts and account_offices by id
accounts %>%
left_join(account_offices, by="id") %>%
# Convert totals from the Tokyo office to USD
mutate(total_usd = ifelse(office == "Tokyo", total / 104, total)) %>%
# Scatter plot of opening date vs total_usd
ggplot(aes(x = date_opened, y = total_usd)) +
geom_point()
The points in your last scatter plot all fall within a much smaller range now and you’ll be able to accurately assess the differences between accounts from different countries.
Link: https://www.buzzfeednews.com/article/katienotopoulos/graphs-that-lied-to-us
What to do with nonsense data. Source: DataCamp
(Note, impute means to assign (a value) to something by inference from the value of the products or processes to which it contributes.)
In this lesson, you’ll continue to work with the accounts data frame, but this time, you have a bit more information about each account. There are three different funds that account holders can store their money in. In this exercise, you’ll validate whether the total amount in each account is equal to the sum of the amount in fund_A, fund_B, and fund_C. If there are any accounts that don’t match up, you can look into them further to see what went wrong in the bookkeeping that led to inconsistencies.
# Find invalid totals
accounts %>%
# theoretical_total: sum of the three funds
mutate(theoretical_total = fund_A + fund_B + fund_C) %>%
# Find accounts where total doesn't match theoretical_total
filter(theoretical_total != total)
By using cross field validation, you’ve been able to detect values that don’t make sense. How you choose to handle these values will depend on the dataset.
Now that you found some inconsistencies in the total amounts, you’re suspicious that there may also be inconsistencies in the acct_agecolumn, and you want to see if these inconsistencies are related. Using the skills you learned from the video exercise, you’ll need to validate the age of each account and see if rows with inconsistent acct_ages are the same ones that had inconsistent totals.
# Find invalid acct_age
accounts %>%
# theoretical_age: age of acct based on date_opened
mutate(theoretical_age = floor(as.numeric(date_opened %--% today(), "years"))) %>%
# Filter for rows where acct_age is different from theoretical_age
filter(theoretical_age != acct_age)
There are three accounts that all have ages off by one year, but none of them are the same as the accounts that had total inconsistencies, so it looks like these two bookkeeping errors may not be related.
What is missing data. Source: DataCamp
Types of missingness. Source: DataCamp
Dealing with missing data. Source: DataCamp
You just learned about the three flavors of missing data: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). In this exercise, you’ll solidify your new knowledge by mapping examples to the types of missingness.
Question types of missingness. Source: DataCamp
Dealing with missing data is one of the most common tasks in data science. There are a variety of types of missingness, as well as a variety of types of solutions to missing data.
You just received a new version of the accounts data frame containing data on the amount held and amount invested for new and existing customers. However, there are rows with missing inv_amount values.
You know for a fact that most customers below 25 do not have investment accounts yet, and suspect it could be driving the missingness.
# Visualize the missing values by column
vis_miss(accounts)
accounts %>%
# missing_inv: Is inv_amount missing?
mutate(missing_inv = is.na(inv_amount)) %>%
# Group by missing_inv
group_by(missing_inv) %>%
# Calculate mean age for each missing_inv group
summarize(avg_age = mean(age, na.rm=TRUE))
Since the average age for missing_inv = TRUE is 22 and the average age for missing_inv = FALSE is 44, it is likely that the inv_amount variable is missing mostly in young customers.
# Sort by age and visualize missing vals
accounts %>%
arrange(age) %>%
vis_miss()
Investigating summary statistics based on missingness is a great way to determine if data is missing completely at random or missing at random.
In this exercise, you’re working with another version of the accounts data that contains missing values for both the cust_id and acct_amount columns.
You want to figure out how many unique customers the bank has, as well as the average amount held by customers. You know that rows with missing cust_id don’t really help you, and that on average, the acct_amount is usually 5 times the amount of inv_amount.
In this exercise, you will drop rows of accounts with missing cust_ids, and impute missing values of inv_amount with some domain knowledge.
We’ll need to install and load the assertive package.
# Create accounts_clean
accounts_clean <- accounts %>%
# Filter to remove rows with missing cust_id
filter(!is.na(cust_id)) %>%
# Add new col acct_amount_filled with replaced NAs
mutate(acct_amount_filled = ifelse(is.na(acct_amount), inv_amount * 5, acct_amount))
# Assert that cust_id has no missing vals
assert_all_are_not_na(accounts_clean$cust_id)
# or
sum(is.na(accounts_clean$cust_id))
# Assert that acct_amount_filled has no missing vals
assert_all_are_not_na(accounts_clean$acct_amount_filled)
Since your assertions passed, there’s no missing data left, and you can definitely bank on nailing your analysis!
Record linkage is a powerful technique used to merge multiple datasets together, used when values have typos or different spellings. In this chapter, you’ll learn how to link records by calculating the similarity between strings—you’ll then use your new skills to join two restaurant review datasets into one clean master dataset.
Minimum edit distance. Source: DataCamp
Edit distance = 1. Source: DataCamp
Edit distance = 4. Source: DataCamp
Types of edit distance. Source: DataCamp
Comparing strings to clean data. Source: DataCamp
In the video exercise, you saw how to use Damerau-Levenshtein distance to identify how similar two strings are. As a reminder, Damerau-Levenshtein distance is the minimum number of steps needed to get from String A to String B, using these operations:
Substituting and inserting is the best way to get from "puffin" to "muffins". In the next exercise, you’ll calculate string distances using R functions.
In the video exercise, you learned that there are multiple ways to calculate how similar or different two strings are. Now you’ll practice using the stringdist package to compute string distances using various methods. It’s important to be familiar with different methods, as some methods work better on certain datasets, while others work better on other datasets.
library(stringdist)
##
## Attaching package: 'stringdist'
## The following object is masked from 'package:tidyr':
##
## extract
# Calculate Damerau-Levenshtein distance
stringdist("las angelos", "los angeles", method = "dl")
## [1] 2
# Calculate LCS distance
stringdist("las angelos", "los angeles", method = "lcs")
## [1] 4
The Jaccard method 1. Count the total number of distinct elements in the two strings. 2. Count the number of those elements which appear in both strings. 3. Divide the latter by the former and multiply by 100.
# Calculate Jaccard distance
stringdist("las angelos", "los angeles", method = "jaccard")
## [1] 0
As there are no elements in the first string that aren’t in the second string, and vice versa, the Jaccard method finds no difference between the two strings. In this way, the Jaccard method treats the strings more as sets of letters rather than sequences of letters.
In the next exercise, you’ll use Damerau-Levenshtein distance to map typo-ridden cities to their true spellings.
In this chapter, one of the datasets you’ll be working with, zagat, is a set of restaurants in New York, Los Angeles, Atlanta, San Francisco, and Las Vegas. The data is from Zagat, a company that collects restaurant reviews, and includes the restaurant names, addresses, phone numbers, as well as other restaurant information.
The city column contains the name of the city that the restaurant is located in. However, there are a number of typos throughout the column. Your task is to map each city to one of the five correctly-spelled cities contained in the cities data frame.
We’ll need to install and load fuzzyjoin package.
library(fuzzyjoin)
# Count the number of each city variation
zagat %>%
count(city)
# Join zagat and cities and look at results
zagat %>%
# Left join based on stringdist using city and city_actual cols
stringdist_left_join(cities, by = c("city" = "city_actual")) %>%
# Select the name, city, and city_actual cols
select(name, city, city_actual)
Now that you’ve created consistent spelling for each city, it will be much easier to compute summary statistics by city.
When joins wont work. Source: DataCamp
What is record linkage? Source: DataCamp
Similar to joins, record linkage is the act of linking data from different sources regarding the same entity. But unlike joins, record linkage does not require exact matches between different pairs of data, and instead can find close matches using string similarity. This is why record linkage is effective when there are no common unique keys between the data sources you can rely upon when linking data sources such as a unique identifier.
Link or join. Source: DataCamp
Don’t make things more complicated than they need to be: record linkage is a powerful tool, but it’s more complex than using a traditional join.
Zagat and Fodor’s are both companies that gather restaurant reviews. The zagat and fodors datasets both contain information about various restaurants, including addresses, phone numbers, and cuisine types. Some restaurants appear in both datasets, but don’t necessarily have the same exact name or phone number written down. In this chapter, you’ll work towards figuring out which restaurants appear in both datasets.
The first step towards this goal is to generate pairs of records so that you can compare them. In this exercise, you’ll first generate all possible pairs, and then use your newly-cleaned city column as a blocking variable.
# Load reclin
library(reclin)
# Generate all possible pairs
pair_blocking(zagat, fodors)
# Generate all possible pairs
pair_blocking(zagat, fodors, blocking_var = "city")
By using city as a blocking variable, you were able to reduce the number of pairs you’ll need to compare from 165,230 pairs to 40,532.
Now that you’ve generated the pairs of restaurants, it’s time to compare them. You can easily customize how you perform your comparisons using the by and default_comparator arguments. There’s no right answer as to what each should be set to, so in this exercise, you’ll try a couple options out.
# Generate pairs
pair_blocking(zagat, fodors, blocking_var = "city") %>%
# Compare pairs by name using lcs()
compare_pairs(by ="name",
default_comparator = lcs())
# Generate pairs
pair_blocking(zagat, fodors, blocking_var = "city") %>%
# Compare pairs by name using lcs()
compare_pairs(by = c("name", "phone", "addr"),
default_comparator = jaro_winkler())
Choosing a comparator and the columns to compare is highly dataset-dependent, so it’s best to try out different combinations to see which works best on the dataset you’re working with. Next, you’ll build on your string comparison skills and learn about record linkage!
Record linkage requires a number of steps that can be difficult to keep straight. In this exercise, you’ll solidify your knowledge of the record linkage process so that it’s a breeze when you code it yourself!
Question order for record linkage. Source: DataCamp
During this chapter, you’ve cleaned up the city column of zagat using string similarity, as well as generated and compared pairs of restaurants from zagat and fodors. The end is near - all that’s left to do is score and select pairs and link the data together, and you’ll be able to begin your analysis in no time!
# Create pairs
pair_blocking(zagat, fodors, blocking_var = "city") %>%
# Compare pairs
compare_pairs(by = "name", default_comparator = jaro_winkler()) %>%
# Score pairs
score_problink() %>%
# Select pairs
select_n_to_m() %>%
# Link data
link()
Now that your two datasets are merged, you can use the data to figure out if there are certain characteristics that make a restaurant more likely to be reviewed by Zagat or Fodor’s.
Module 5, chapter 1 summary. Source: DataCamp
Module 5, chapter 2 summary. Source: DataCamp
Module 5, chapter 3 summary. Source: DataCamp
Module 5, chapter 4 summary. Source: DataCamp
Module 5 - further courses. Source: DataCamp
The ability to produce meaningful and beautiful data visualizations is an essential part of your skill set as a data scientist. This course, the first R data visualization tutorial in the series, introduces you to the principles of good visualizations and the grammar of graphics plotting concepts implemented in the ggplot2 package. ggplot2 has become the go-to tool for flexible and professional plots in R. Here, we’ll examine the first three essential layers for making a plot - Data, Aesthetics and Geometries. By the end of the course you will be able to make complex exploratory plots.
In this chapter we’ll get you into the right frame of mind for developing meaningful visualizations with R. You’ll understand that as a communications tool, visualizations require you to think about your audience first. You’ll also be introduced to the basics of ggplot2 - the 7 different grammatical elements (layers) and aesthetic mappings.
Data viz is rooted in statistics and graphical data analysis, but it’s also a creative process than involves some amount of trial and error.
In this video we made the distinction between plots for exploring and plots for explaining data. Exploratory plots are typically meant for a specialist audience, data-heavy, rough first drafts and part of our data science toolkit as graphical data analysis. They are not typically pretty!
You’re not concerned with beautiful at this point. However, the plots should be meaningful and conform to best practices so that you do not mislead yourself!
To get a first feel for ggplot2, let’s try to run some basic ggplot2 commands. The mtcars dataset contains information on 32 cars from a 1973 issue of Motor Trend magazine. This dataset is small, intuitive, and contains a variety of continuous and categorical variables.
# ggplot2 package loaded earlier
# Explore the mtcars data frame with str()
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
# Execute the following command
ggplot(mtcars, aes(cyl, mpg)) +
geom_point()
Notice that ggplot2 treats cyl as a continuous variable. You get a plot, but it’s not quite right, because it gives the impression that there is such a thing as a 5 or 7-cylinder car, which there is not.
The plot from the previous exercise wasn’t really satisfying. Although cyl (the number of cylinders) is categorical, you probably noticed that it is classified as numeric in mtcars. This is really misleading because the representation in the plot doesn’t match the actual data type. You’ll have to explicitly tell ggplot2 that cyl is a categorical variable.
# Change the command below so that cyl is treated as factor
ggplot(mtcars, aes(factor(cyl), mpg)) +
geom_point()
Notice that ggplot2 treats cyl as a factor. This time the x-axis does not contain variables like 5 or 7, only the values that are present in the dataset.
Let’s dive a little deeper into the three main topics in this course: The data, aesthetics, and geom layers. We’ll get to making pretty plots in the last chapter with the themes layer.
We’ll continue working on the 32 cars in the mtcars data frame.
Consider how the examples and concepts we discuss throughout these courses apply to your own data-sets!
# Edit to add a color aesthetic mapped to disp
ggplot(mtcars, aes(wt, mpg, color=disp)) +
geom_point()
# Change the color aesthetic to a size aesthetic
ggplot(mtcars, aes(wt, mpg, size = disp)) +
geom_point()
Notice that a legend for the color and size scales was automatically generated.
In the previous exercise you saw that disp can be mapped onto a color gradient or onto a continuous size scale.
Another argument of aes() is the shape of the points. There are a finite number of shapes which ggplot() can automatically assign to the points. However, if you try this command:
ggplot(mtcars, aes(wt, mpg, shape = disp)) +
geom_point()
## Error: A continuous variable can not be mapped to shape
it gives an error.
The error message ‘A continuous variable can not be mapped to shape’, means that shape doesn’t exist on a continuous scale here.
The diamonds dataset contains details of 1,000 diamonds. Among the variables included are carat (a measurement of the diamond’s size) and price.
You’ll use two common geom layer functions:
geom_point() adds points (as in a scatter plot).geom_smooth() adds a smooth trend curve.As you saw previously, these are added using the + operator.
ggplot(data, aes(x, y)) +
geom_*()
Where * is the specific geometry needed.
# Explore the diamonds data frame with str()
str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
# Add geom_smooth() with +
ggplot(diamonds, aes(carat, price)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
If you had executed the command without adding a +, it would produce an error message ‘No layers in plot’ because you are missing the third essential layer - the geom layer.
If you have multiple geoms, then mapping an aesthetic to data variable inside the call to ggplot() will change all the geoms. It is also possible to make changes to individual geoms by passing arguments to the indivisual geom_*() functions.
FOr example, geom_point() has an argument alpha that controls the opacity of the points. A value of 1 (the default) means that the points are totally opaque; a value of 0 means the points are totally transparent (and therefore invisible). Values in between specify transparency.
We’ll amend our previous plot.
# Make the points 40% opaque
ggplot(diamonds, aes(carat, price, color = clarity)) +
geom_point(alpha=0.4) +
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
geom_point() + geom_smooth() is a common combination.
Plots can be saved as variables, which can be added to later on using the + operator. This is really useful if you want to make multiple related plots from a common base.
# Draw a ggplot
plt_price_vs_carat <- ggplot(
# Use the diamonds dataset
diamonds,
# For the aesthetics, map x to carat and y to price
aes(x=carat, y=price)
)
# Add a point layer to plt_price_vs_carat
plt_price_vs_carat + geom_point()
# Edit this to make points 20% opaque: plt_price_vs_carat_transparent
plt_price_vs_carat_transparent <- plt_price_vs_carat + geom_point(alpha=0.2)
# See the plot
plt_price_vs_carat_transparent
# Edit this to map color to clarity,
# Assign the updated plot to a new object
plt_price_vs_carat_by_clarity <- plt_price_vs_carat + geom_point(aes(color=clarity))
# See the plot
plt_price_vs_carat_by_clarity
By assigning parts of plots to a variable then reusing that variable in other plots, it makes it really clear how much those plots have in common.
Aesthetic mappings are the cornerstone of the grammar of graphics plotting concept. This is where the magic happens - converting continuous and categorical data into visual scales that provide access to a large amount of information in a very short time. In this chapter you’ll understand how to choose the best aesthetic msiappings for your data.
Species, a dataframe column, is mapped onto color, a visible Each mapped variable is its own column variable in the dataframe.
In general, try to keep your data and aethestics layer in the same ggplot definition.
In the video you saw 9 visible aesthetics. Let’s apply them to a categorical variable — the cylinders in mtcars, cyl.
These are the aesthetics you can consider within aes() in this chapter: x, y, color, fill, size, alpha, labels and shape.
One common convention is that you don’t name the x and y arguments to aes(), since they almost always come first, but you do name other arguments.
In the following exercise the fcyl column is categorical. It is cyl transformed into a factor.
# Create fycl, whilst preserving row names of mtcars (dplyr doesn't preserve row names with functions like mutate and filter)
library(tibble)
## Warning: package 'tibble' was built under R version 3.5.3
##
## Attaching package: 'tibble'
## The following object is masked from 'package:assertive':
##
## has_rownames
mtcars <- mtcars %>%
rownames_to_column('carnames') %>%
mutate(fcyl = as.factor(cyl)) %>%
column_to_rownames('carnames')
# Map x to mpg and y to fcyl
ggplot(mtcars, aes(mpg, fcyl)) +
geom_point()
# Swap mpg and fcyl
ggplot(mtcars, aes(fcyl, mpg)) +
geom_point()
# Map x to wt, y to mpg and color to fcyl
ggplot(mtcars, aes(wt, mpg, color=fcyl)) +
geom_point()
ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
# Set the shape and size of the points
geom_point(shape=1, size=4)
Typically, the color aesthetic changes the outline of a geom and the fill aesthetic changes the inside. geom_point() is an exception: you use color (not fill) for the point color. However, some shapes have special behavior.
By default, geom_point() uses shape = 19: a solid circle. An alternative is shape = 21: a circle that allow you to use both fill for the inside and color for the outline. This is lets you to map two aesthetics to each point.
All shape values are described on the points() help page.
fcyl and fam are the cyl and am columns converted to factors, respectively.
# Create fam, whilst preserving row names of mtcars
mtcars <- mtcars %>%
rownames_to_column('carnames') %>%
mutate(fam = as.factor(am)) %>%
column_to_rownames('carnames')
# Map fcyl to fill
ggplot(mtcars, aes(wt, mpg, fill = fcyl)) +
geom_point(shape = 1, size = 4)
ggplot(mtcars, aes(wt, mpg, fill = fcyl)) +
# Change point shape; set alpha
geom_point(shape = 21, size = 4, alpha = 0.6)
# Map color to fam
ggplot(mtcars, aes(wt, mpg, fill = fcyl, color = fam)) +
geom_point(shape = 21, size = 4, alpha = 0.6)
Notice that mapping a categorical variable onto fill doesn’t change the colors, although a legend is generated! This is because the default shape for points only has a color attribute and not a fill attribute! Use fill when you have another shape (such as a bar), or when using a point that does have a fill and a color attribute, such as shape = 21, which is a circle with an outline. Any time you use a solid color, make sure to use alpha blending to account for over plotting.
Now that you’ve got some practice with using attributes, be careful of a major pitfall: these attributes can overwrite the aesthetics of your plot!
# Establish the base layer
plt_mpg_vs_wt <- ggplot(mtcars, aes(wt, mpg))
# Map fcyl to size
plt_mpg_vs_wt +
geom_point(aes(size = fcyl))
## Warning: Using size for a discrete variable is not advised.
# Map fcyl to alpha, not size
plt_mpg_vs_wt +
geom_point(aes(alpha = fcyl))
## Warning: Using alpha for a discrete variable is not advised.
# Map fcyl to shape, not alpha
plt_mpg_vs_wt +
geom_point(aes(shape = fcyl))
# Use text layer and map fcyl to label
plt_mpg_vs_wt +
geom_text(aes(label = fcyl))
Which aesthetic do you think is the clearest for categorical data?
Many of the aesthetics can accept either continuous or categorical variables, but some are restricted to categorical data. For example, label and shape are only applicable to categorical data.
We usually use the word “aesthestics” to describe how something looks, but in ggplot2 we use the word to refer aesthetic mappings. We use the word “attributes” to describe how something looks. Note, confusingly all of our visible aesthestics also exist as attributes. Attributes are always called in the geom layer (see next chapter). For examplle,
geom_*(color = "red")sets the color attribute to “red”. Similar forsizeandshape.
This time you’ll use these arguments to set attributes of the plot, not map variables onto aesthetics.
You can specify colors in R using hex codes: a hash followed by two hexadecimal numbers each for red, green, and blue ("#RRGGBB"). Hexadecimal is base-16 counting. You have 0 to 9, and A representing 10 up to F representing 15. Pairs of hexadecimal numbers give you a range from 0 to 255. "#000000" is “black” (no color), "#FFFFFF" means “white”, and "#00FFFF" is cyan (mixed green and blue).
We’ll define a hexadecimal color variable my_blue.
# A hexadecimal color
my_blue <- "#4ABEFF"
ggplot(mtcars, aes(wt, mpg)) +
# Set the point color and alpha
geom_point(color = my_blue, alpha = 0.6)
# Change the color mapping to a fill mapping
ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
# Set point size and shape
geom_point(size = 10, shape = 1)
becomes…
# Change the color mapping to a fill mapping
ggplot(mtcars, aes(wt, mpg, fill = fcyl)) +
# Set point size and shape
geom_point(color = my_blue, size = 10, shape = 1)
ggplot2 lets you control these attributes in many ways to customize your plots.
In the videos you saw that you can use all the aesthetics as attributes. Let’s see how this works with the aesthetics you used in the previous exercises: x, y, color, fill, size, alpha, label and shape.
In this exercise you will set all kinds of attributes of the points!
You will continue to work with mtcars.
ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
# Add point layer with alpha 0.5
geom_point(alpha = 0.5)
ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
# Add text layer with label rownames(mtcars) and color red
geom_text(aes(label = rownames(mtcars)), color = "red")
although DataCamp has the answer as
ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
# Add text layer with label rownames(mtcars) and color red
geom_text(label = rownames(mtcars), color = "red")
ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
# Add points layer with shape 24 and color yellow
geom_point(shape = 24, color = "yellow")
In this exercise, you will gradually add more aesthetics layers to the plot. You’re still working with the mtcars dataset, but this time you’re using more features of the cars. Each of the columns is described on the mtcars help page.
Notice that adding more aesthetic mappings to your plot is not always a good idea! You may just increase complexity and decrease readability.
# 3 aesthetics: qsec vs. mpg, colored by fcyl
ggplot(mtcars, aes(mpg, qsec, color = fcyl)) +
geom_point()
# 4 aesthetics: add a mapping of shape to fam
ggplot(mtcars, aes(mpg, qsec, color = fcyl, shape = fam)) +
geom_point()
# 5 aesthetics: add a mapping of size to hp / wt
ggplot(mtcars, aes(mpg, qsec, color = fcyl, shape = fam, size = hp/wt)) +
geom_point()
Between the x and y dimensions, the color, shape, and size of the points, your plot displays five dimensions of the dataset!
Positions: How to adjust for overlapping on a single layer.
position =
position = "identity"). However, if your data is measured to the nearest, say, mm, many points might overlap. We might to add some random noise on both axes to account for this, using…width, e.g.(0.1,) which says how much random noise should be added, and seed =, which sets the starting number used to generate a sequence of random numbers).Each of these aesthetics is a “scale” that we map data onto. So color is just a scale, like x and y are scales. We can access all these scales using the scale_ functions:
scale_x_*()scale_y_*()scale_color_*()
scale_colour_*()scale_fill_*()scale_shape_*()scale_linetype_*()scale_size_*()What’s the * mean?
scale_x_continuous()scale_color_discrete()In this exercise, you’ll modify some aesthetics to make a bar plot of the number of cylinders for cars with different types of transmission.
You’ll also make use of some functions for improving the appearance of the plot.
labs() to set the x- and y-axis labels. It takes strings for each argument.scale_color_manual() defines properties of the color scale (i.e. axis). The first argument sets the legend title. values is a named vector of colors to use.ggplot(mtcars, aes(fcyl, fill = fam)) +
geom_bar() +
# Set the axis labels
labs(x = "Number of Cylinders",
y = "Count")
palette <- c("#377EB8", "#E41A1C")
famlabs <- c("Automatic", "Manual")
ggplot(mtcars, aes(fcyl, fill = fam)) +
geom_bar() +
labs(x = "Number of Cylinders", y = "Count") +
# Set the fill color scale
scale_fill_manual("Transmission", values = palette, labels = famlabs)
# Set the position
ggplot(mtcars, aes(fcyl, fill = fam)) +
geom_bar(position = "dodge") +
labs(x = "Number of Cylinders", y = "Count") +
# Set the fill color scale
scale_fill_manual("Transmission", values = palette, labels = famlabs)
Choosing the right position argument is an important part of making a good plot.
In the last chapter you saw that all the visible aesthetics can serve as attributes and aesthetics, but I very conveniently left out x and y. That’s because although you can make univariate plots (such as histograms, which you’ll get to in the next chapter), a y-axis will always be provided, even if you didn’t ask for it.
You can make univariate plots in ggplot2, but you will need to add a fake y axis by mapping y to zero.
When using setting y-axis limits, you can specify the limits as separate arguments, or as a single numeric vector. That is, ylim(lo, hi) or ylim(c(lo, hi)).
# Plot 0 vs. mpg
ggplot(mtcars, aes(mpg, 0)) +
# Add jitter
geom_point(position = "jitter")
ggplot(mtcars, aes(mpg, 0)) +
geom_jitter() +
# Set the y-axis limits
ylim(-2, 2)
The best way to make your plot depends on a lot of different factors and sometimes ggplot2 might not be the best choice.
Incorrect aesthetic mapping causes confusion or misleads the audience.
Typically, the dependent variable is mapped onto the the y-axis and the independent variable is mapped onto the x-axis.
In the ToothGrowth data set, we have three variables:
| Variable | Description |
|---|---|
len |
Tooth length |
supp |
Supplement type (VC or OJ) |
dose |
Dose in milligrams/day |
From the six possible ways to map three variables, one solution is shown in the viewer. However, x = supp, y = len, color = dose would be better.
A plot’s geometry dictates what visual elements will be used. In this chapter, we’ll familiarize you with the geometries used in the three most common plot types you’ll encounter - scatter plots, bar charts and line plots. We’ll look at a variety of different ways to construct these plots.
There are nearly fifty different geometries to choose from, though there are some redundancies. Source: DataCamp
Each geom is associated with specific aesthetic mappings. Some of these are essential, and some are optional.
For example, for geom_point(), x and y are essential, whereas alpha, color, fill, shape, size, and stroke are optional attribute settings.
Shape attribute values. Source: DataCamp
iris %>%
group_by(Species) %>%
summarise_all(mean) -> iris.summary
Scatter plots (using geom_point()) are intuitive, easily understood, and very common, but we must always consider overplotting, particularly in the following four situations:
Typically, alpha blending (i.e. adding transparency) is recommended when using solid shapes. Alternatively, you can use opaque but hollow shapes.
Small points are suitable for large datasets with regions of high density (lots of overlapping).
Let’s use the diamonds dataset to practice dealing with the large dataset case.
Recall earlier we created the base plot plt_price_vs_carat_by_clarity. We will redefine it slightly, by moving the color aesthing mapping (of clarity) inside ggplot.
# Plot price vs. carat, colored by clarity
plt_price_vs_carat_by_clarity <- ggplot(diamonds, aes(carat, price, color = clarity))
# Add a point layer with tiny points
plt_price_vs_carat_by_clarity + geom_point(alpha = 0.5, shape =".")
# Change shape
plt_price_vs_carat_by_clarity + geom_point(alpha = 0.5, shape = 16)
Let’s take a look at another case where we should be aware of overplotting: Aligning values on a single axis.
This occurs when one axis is continuous and the other is categorical, which can be overcome with some form of jittering.
# Plot base
plt_mpg_vs_fcyl_by_fam <- ggplot(mtcars, aes(fcyl, mpg, color = fam))
# Without jittering
plt_mpg_vs_fcyl_by_fam + geom_point()
# Alter the point positions by jittering, width 0.3
plt_mpg_vs_fcyl_by_fam + geom_point(position = position_jitter(0.3))
# Now jitter and dodge the point positions
plt_mpg_vs_fcyl_by_fam + geom_point(position = position_jitterdodge(jitter.width = 0.3, dodge.width = 0.3))
Note: Dodging preserves the vertical position of a geom while adjusting the horizontal position.
These are some simple ways of dealing with overplotting, but you’ll encounter more ideas througout the ggplot2 courses when we encounter atypical geoms.
You already saw how to deal with overplotting when using geom_point() in two cases:
We used position = 'jitter' inside geom_point() or geom_jitter().
Let’s take a look at another case:
This results from low-resolution measurements like in the iris dataset, which is measured to 1mm precision (see viewer). It’s similar to case 2, but in this case we can jitter on both the x and y axis.
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
# Swap for jitter layer with width 0.1
geom_jitter(alpha = 0.5, width = 0.1)
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
# Set the position to jitter
geom_point(alpha = 0.5, position = "jitter")
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
# Use a jitter position function with width 0.1
geom_point(alpha = 0.5, position = position_jitter(width = 0.1))
Notice that jitter can be a geom itself (i.e. geom_jitter()), an argument in geom_point() (i.e. position = "jitter"), or a position function, (i.e. position_jitter()).
Let’s take a look at the last case of dealing with overplotting:
This can be type integer (i.e. 1 ,2, 3…) or categorical (i.e. class factor) variables. factor is just a special class of type integer.
You’ll typically have a small, defined number of intersections between two variables, which is similar to case 3, but you may miss it if you don’t realize that integer and factor data are the same as low precision data.
The Vocab dataset provided contains the years of education and vocabulary test scores from respondents to US General Social Surveys from 1972-2004.
# Read in Vocab, found here: https://raw.githubusercontent.com/anilak1978/r-bridge-week-2-assignment/master/Vocab.csv
Vocab <- read.table("Vocab.csv", sep=",", header = TRUE)
Vocab <- Vocab %>%
mutate(year = as.numeric(year), education = as.factor(education), vocabulary = as.factor(vocabulary)) %>%
column_to_rownames('X')
str(Vocab)
## 'data.frame': 30351 obs. of 4 variables:
## $ year : num 1974 1974 1974 1974 1974 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 1 1 1 2 2 2 1 1 ...
## $ education : Factor w/ 21 levels "0","1","2","3",..: 15 17 11 11 13 17 18 11 13 12 ...
## $ vocabulary: Factor w/ 11 levels "0","1","2","3",..: 10 10 10 6 9 9 10 6 4 6 ...
# Examine the structure of Vocab
str(Vocab)
## 'data.frame': 30351 obs. of 4 variables:
## $ year : num 1974 1974 1974 1974 1974 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 1 1 1 2 2 2 1 1 ...
## $ education : Factor w/ 21 levels "0","1","2","3",..: 15 17 11 11 13 17 18 11 13 12 ...
## $ vocabulary: Factor w/ 11 levels "0","1","2","3",..: 10 10 10 6 9 9 10 6 4 6 ...
# Plot vocabulary vs. education
ggplot(Vocab, aes(education, vocabulary)) +
# Add a point layer
geom_point()
ggplot(Vocab, aes(education, vocabulary)) +
# Change to a jitter layer
geom_jitter()
ggplot(Vocab, aes(education, vocabulary)) +
# Set the transparency to 0.2
geom_jitter(alpha = 0.2)
ggplot(Vocab, aes(education, vocabulary)) +
# Set the shape to 1
geom_jitter(alpha = 0.2, shape = 1)
Notice how jittering and alpha blending serves as a great solution to the overplotting problem here. Setting the shape to 1 didn’t really help, but it was useful in the previous exercises when you had less data. You need to consider each plot individually. You’ll encounter this dataset again when you look at bar plots.
Recall that histograms cut up a continuous variable into discrete bins and, by default, maps the internally calculated count variable (the number of observations in each bin) onto the y aesthetic. An internal variable called density can be accessed by using the .. notation, i.e. ..density... Plotting this variable will show the relative frequency, which is the height times the width of each bin.
# Plot mpg
ggplot(mtcars, aes(mpg)) +
# Add a histogram layer
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(mtcars, aes(mpg)) +
# Set the binwidth to 1
geom_histogram(binwidth = 1)
# Map y to ..density..
ggplot(mtcars, aes(mpg, ..density..)) +
geom_histogram(binwidth = 1)
datacamp_light_blue <- "#51A8C9"
ggplot(mtcars, aes(mpg, ..density..)) +
# Set the fill color to datacamp_light_blue
geom_histogram(binwidth = 1, fill = datacamp_light_blue)
Histograms are one of the most common exploratory plots for continuous data. If you want to use density on the y-axis be sure to set your binwidth to an intuitive value.
Here, we’ll examine the various ways of applying positions to histograms. geom_histogram(), a special case of geom_bar(), has a position argument that can take on the following values:
stack (the default): Bars for different groups are stacked on top of each other.dodge: Bars for different groups are placed side by side.fill: Bars for different groups are shown as proportions.identity: Plot the values as they appear in the dataset.For this example, you’ll use the mtcars dataset.
# Update the aesthetics so the fill color is by fam
ggplot(mtcars, aes(mpg, fill = fam)) +
geom_histogram(binwidth = 1)
ggplot(mtcars, aes(mpg, fill = fam)) +
# Change the position to dodge
geom_histogram(binwidth = 1, position = "dodge")
ggplot(mtcars, aes(mpg, fill = fam)) +
# Change the position to fill
geom_histogram(binwidth = 1, position = "fill")
## Warning: Removed 16 rows containing missing values (geom_bar).
ggplot(mtcars, aes(mpg, fill = fam)) +
# Change the position to identity, with transparency 0.4
geom_histogram(binwidth = 1, position = "identity", alpha = 0.4)
This nicely demonstrates the difference between "stack" (the default) and "identity".
Now proceed with line plots!
Let’s see how the position argument changes geom_bar().
We have three position options:
stack: The defaultdodge: Preferredfill: To show proportionsWhile we will be using geom_bar() here, note that the function geom_col() is just geom_bar() where both the position and stat arguments are set to "identity". It is used when we want the heights of the bars to represent the exact values in the data.
In this exercise, you’ll draw the total count of cars having a given number of cylinders (fcyl), according to manual or automatic transmission type (fam).
# Plot fcyl, filled by fam
ggplot(mtcars, aes(fcyl, fill=fam)) +
# Add a bar layer
geom_bar()
ggplot(mtcars, aes(fcyl, fill = fam)) +
# Set the position to "fill"
geom_bar(position = "fill")
ggplot(mtcars, aes(fcyl, fill = fam)) +
# Change the position to "dodge"
geom_bar(position = "dodge")
Different kinds of plots need different position arguments, so it’s important to be familiar with this attribute.
You can customize bar plots further by adjusting the dodging so that your bars partially overlap each other. Instead of using position = "dodge", you’re going to use position_dodge(), like you did with position_jitter() in the the previous exercises. Here, you’ll save this as an object, posn_d, so that you can easily reuse it.
Remember, the reason you want to use position_dodge() (and position_jitter()) is to specify how much dodging (or jittering) you want.
We start with (from before)…
ggplot(mtcars, aes(cyl, fill = fam)) +
# Change position to use the functional form, with width 0.2
geom_bar(position = "dodge")
…then move to…
ggplot(mtcars, aes(cyl, fill = fam)) +
# Change position to use the functional form, with width 0.2
geom_bar(position = position_dodge(width = 0.2))
finishing with…
ggplot(mtcars, aes(cyl, fill = fam)) +
# Set the transparency to 0.6
geom_bar(position = position_dodge(width = 0.2), alpha = 0.6)
By using these position functions, you can customize your plot to suit your needs.
In this bar plot, we’ll fill each segment according to an ordinal variable. The best way to do that is with a sequential color palette.
Here’s an example of using a sequential color palette with the mtcars dataset:
ggplot(mtcars, aes(fcyl, fill = fam)) +
geom_bar() +
scale_fill_brewer(palette = "Set1")
In the exercise, you’ll use similar code on the the Vocab dataset. Both datasets are ordinal.
# Plot education, filled by vocabulary
ggplot(Vocab, aes(education, fill = vocabulary))
# Plot education, filled by vocabulary
ggplot(Vocab, aes(education, fill = vocabulary)) +
# Add a bar layer with position "fill"
geom_bar(position = "fill")
# Plot education, filled by vocabulary
ggplot(Vocab, aes(education, fill = vocabulary)) +
# Add a bar layer with position "fill"
geom_bar(position = "fill") +
# Add a brewer fill scale with default palette
scale_fill_brewer(palette = "Set1")
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors
The plot is not complete! Let’s fix this in the next exercise.
Here, we’ll use the economics dataset to make some line plots. The dataset contains a time series for unemployment and population statistics from the Federal Reserve Bank of St. Louis in the United States. The data is contained in the ggplot2 package.
To begin with, you can look at how the median unemployment time and the unemployment rate (the number of unemployed people as a proportion of the population) change over time.
# Print the head of economics
head(economics)
## # A tibble: 6 x 6
## date pce pop psavert uempmed unemploy
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1967-07-01 507. 198712 12.6 4.5 2944
## 2 1967-08-01 510. 198911 12.6 4.7 2945
## 3 1967-09-01 516. 199113 11.9 4.6 2958
## 4 1967-10-01 512. 199311 12.9 4.9 3143
## 5 1967-11-01 517. 199498 12.8 4.7 3066
## 6 1967-12-01 525. 199657 11.8 4.8 3018
# Using economics, plot unemploy vs. date
ggplot(economics, aes(date, unemploy)) +
# Make it a line plot
geom_line()
# Change the y-axis to the proportion of the population that is unemployed
ggplot(economics, aes(date, unemploy/pop)) +
geom_line()
In the next exercise, we’ll make more complicated line plots.
We already saw how the form of your data affects how you can plot it. Let’s explore that further with multiple time series. Here, it’s important that all lines are on the same scale, and if possible, on the same plot.
fish.species contains the global capture rates of seven salmon species from 1950–2010. Each variable (column) is a Salmon species and each observation (row) is one year. fish.tidy contains the same data, but in three columns: Species, Year, and Capture (i.e. one variable per column).
The following code will not be run as the dataset is not available.
# Plot the Rainbow Salmon time series
ggplot(fish.species, aes(x = Year, y = Rainbow)) +
geom_line()
# Plot the Pink Salmon time series
ggplot(fish.species, aes(x = Year, y = Pink)) +
geom_line()
# Plot multiple time-series by grouping by species
ggplot(fish.tidy, aes(Year, Capture)) +
geom_line(aes(group = Species))
# Plot multiple time-series by coloring by species
ggplot(fish.tidy, aes(Year, Capture, color = Species)) +
geom_line(aes(group = Species))
As you can see in the the last couple of plots, a grouping aesthetic was vital here. If you don’t specify color = Species, you’ll get a mess of lines.
In this chapter, we’ll explore how understanding the structure of your data makes data visualization much easier. Plus, it’s time to make our plots pretty. This is the last step in the data viz process. The Themes layer will enable you to make publication quality plots directly in R. In the next course we’ll look at some extra layers to add more variables to your plots.
Let’s wrap up this course by making a publication-ready plot communicating a clear message.
To change stylistic elements of a plot, call theme() and set plot properties to a new value. For example, the following changes the legend position.
p + theme(legend.position = new_value)
Here, the new value can be
"top", "bottom", "left", or "right": place it at that side of the plot."none": don’t draw it.c(x, y), where c(0, 0) means the bottom-left and c(1, 1) means the top-right.Let’s revisit the recession period line plot (assigned to plt_prop_unemployed_over_time).
# View the default plot
plt_prop_unemployed_over_time
# Remove legend entirely
plt_prop_unemployed_over_time +
theme(legend.position = "none")
# Position the legend at the bottom of the plot
plt_prop_unemployed_over_time +
theme(legend.position = "bottom")
# Position the legend inside the plot at (0.6, 0.1)
plt_prop_unemployed_over_time +
theme(legend.position = c(0.6, 0.1))
But be careful when placing a legend inside your plotting space. You could end up obscuring data.
Many plot elements have multiple properties that can be set. For example, line elements in the plot such as axes and gridlines have a color, a thickness (size), and a line type (solid line, dashed, or dotted). To set the style of a line, you use element_line(). For example, to make the axis lines into red, dashed lines, you would use the following.
p + theme(axis.line = element_line(color = "red", linetype = "dashed"))
Similarly, element_rect() changes rectangles and element_text() changes text. You can remove a plot element using element_blank().
plt_prop_unemployed_over_time +
theme(
# For all rectangles, set the fill color to grey92
rect = element_rect(fill = "grey92"),
# For the legend key, turn off the outline
legend.key = element_rect(color = NA)
)
plt_prop_unemployed_over_time +
theme(
rect = element_rect(fill = "grey92"),
legend.key = element_rect(color = NA),
# Turn off axis ticks
axis.ticks = element_blank(),
# Turn off the panel grid
panel.grid = element_blank()
)
plt_prop_unemployed_over_time +
theme(
rect = element_rect(fill = "grey92"),
legend.key = element_rect(color = NA),
axis.ticks = element_blank(),
panel.grid = element_blank(),
panel.grid.major.y = element_line(
color = "white",
size = 0.5,
linetype = "dotted"
),
# Set the axis text color to grey25
axis.text = element_text(color = "grey25"),
# Set the plot title font face to italic and font size to 16
plot.title = element_text(size = 16, face = "italic")
)
This plot is ready for prime time – it’s pretty AND informative. Make sure that all your text is legible for the context in which it will be viewed.
Whitespace means all the non-visible margins and spacing in the plot.
To set a single whitespace value, use unit(x, unit), where x is the amount and unit is the unit of measure.
Borders require you to set 4 positions, so use margin(top, right, bottom, left, unit). To remember the margin order, think TRouBLe.
The default unit is "pt" (points), which scales well with text. Other options include “cm”, “in” (inches) and “lines” (of text).
plt_mpg_vs_wt_by_cyl is available. The panel and legend are wrapped in blue boxes so you can see how they change.
# View the original plot
plt_mpg_vs_wt_by_cyl
plt_mpg_vs_wt_by_cyl +
theme(
# Set the axis tick length to 2 lines
axis.ticks.length = unit(2, "lines")
)
plt_mpg_vs_wt_by_cyl +
theme(
# Set the legend key size to 3 centimeters
legend.key.size = unit(3, "cm")
)
plt_mpg_vs_wt_by_cyl +
theme(
# Set the legend margin to (20, 30, 40, 50) points
legend.margin = margin(20, 30, 40, 50, "pt")
)
plt_mpg_vs_wt_by_cyl +
theme(
# Set the plot margin to (10, 30, 50, 70) millimeters
plot.margin = margin(10, 30, 50, 70, "mm")
)
Changing the whitespace can be useful if you need to make your plot more compact, or if you want to create more space to reduce “business”.
In addition to making your own themes, there are several out-of-the-box solutions that may save you lots of time.
theme_gray() is the default.theme_bw() is useful when you use transparency.theme_classic() is more traditional.theme_void() removes everything but the data.# Add a black and white theme
plt_prop_unemployed_over_time +
theme_bw()
# Add a classic theme
plt_prop_unemployed_over_time +
theme_classic()
# Add a void theme
plt_prop_unemployed_over_time +
theme_void()
The black and white theme works really well if you use transparency in your plot.
Outside of ggplot2, another source of built-in themes is the ggthemes package. Let’s explore some of the ready-made ggthemes themes.
# Use the fivethirtyeight theme
plt_prop_unemployed_over_time +
theme_fivethirtyeight()
# Use Tufte's theme
plt_prop_unemployed_over_time +
theme_tufte()
# Use the Wall Street Journal theme
plt_prop_unemployed_over_time +
theme_wsj()
ggthemes has over 20 themes for you to try.
Reusing a theme across many plots helps to provide a consistent style. You have several options for this.
theme_set().A good strategy that you’ll use here is to begin with a built-in theme then modify it.
plt_prop_unemployed_over_time is available. The theme you made earlier is shown in the sample code.
# Save the theme as theme_recession
theme_recession <- theme(
rect = element_rect(fill = "grey92"),
legend.key = element_rect(color = NA),
axis.ticks = element_blank(),
panel.grid = element_blank(),
panel.grid.major.y = element_line(color = "white", size = 0.5, linetype = "dotted"),
axis.text = element_text(color = "grey25"),
plot.title = element_text(face = "italic", size = 16),
legend.position = c(0.6, 0.1)
)
# Combine the Tufte theme with theme_recession
theme_tufte_recession <- theme_tufte() + theme_recession
# Add the Tufte recession theme to the plot
plt_prop_unemployed_over_time + theme_tufte_recession
# Set theme_tufte_recession as the default theme
theme_set(theme_tufte_recession)
# Draw the plot (without explicitly adding a theme)
plt_prop_unemployed_over_time
We’ve seen many examples of beautiful, publication-quality plots. Let’s take a final look and put all the pieces together.
plt_prop_unemployed_over_time +
# Add Tufte's theme
theme_tufte()
plt_prop_unemployed_over_time +
theme_tufte() +
# Add individual theme elements
theme(
# Turn off the legend
legend.position = "none",
# Turn off the axis ticks
axis.ticks = element_blank()
)
plt_prop_unemployed_over_time +
theme_tufte() +
theme(
legend.position = "none",
axis.ticks = element_blank(),
# Set the axis title's text color to grey60
axis.title = element_text(color = "grey60"),
# Set the axis text's text color to grey60
axis.text = element_text(color = "grey60")
)
plt_prop_unemployed_over_time +
theme_tufte() +
theme(
legend.position = "none",
axis.ticks = element_blank(),
axis.title = element_text(color = "grey60"),
axis.text = element_text(color = "grey60"),
# Set the panel gridlines major y values
panel.grid.major.y = element_line(
# Set the color to grey60
color = "grey60",
# Set the size to 0.25
size = 0.25,
# Set the linetype to dotted
linetype = "dotted"
)
)
That will look great in a publication!
Let’s focus on producing beautiful and effective explanatory plots. In the next couple of exercises, you’ll create a plot that is similar to the one shown in the video using gm2007, a filtered subset of the gapminder dataset.
This type of plot will be in an info-viz style, meaning that it would be similar to something you’d see in a magazine or website for a mostly lay audience.
A scatterplot of lifeExp by country, colored by lifeExp, with points of size 4, is provided.
# Create datasets gm2007_full and gm2007
library(gapminder)
## Warning: package 'gapminder' was built under R version 3.5.3
gm2007_full <- gapminder %>%
filter(year == 2007)
gm2007 <- gm2007_full %>%
filter(country == "Swaziland" | country == "Mozambique" | country == "Zambia" | country == "Sierra Leone" | country == "Lesotho" | country == "Angola" | country == "Zimbabwe" | country == "Afghanistan" | country == "Central African Republic" | country == "Liberia" | country == "Canada" | country == "France" | country == "Israel" | country == "Sweden" | country == "Spain" | country == "Australia" | country == "Switzerland" | country == "Iceland" | country == "Hong Kong, China" | country == "Japan")
# Add a geom_segment() layer
ggplot(gm2007, aes(x = lifeExp, y = country, color = lifeExp)) +
geom_point(size = 4) +
geom_segment(aes(xend = 30, yend = country), size = 2)
# Add a geom_text() layer
ggplot(gm2007, aes(x = lifeExp, y = country, color = lifeExp)) +
geom_point(size = 4) +
geom_segment(aes(xend = 30, yend = country), size = 2) +
geom_text(aes(label = lifeExp), color = "white", size = 1.5)
# Set the color scale
library(RColorBrewer)
## Warning: package 'RColorBrewer' was built under R version 3.5.2
palette <- brewer.pal(5, "RdYlBu")[-(2:4)]
# Modify the scales
ggplot(gm2007, aes(x = lifeExp, y = country, color = lifeExp)) +
geom_point(size = 4) +
geom_segment(aes(xend = 30, yend = country), size = 2) +
geom_text(aes(label = round(lifeExp,1)), color = "white", size = 1.5) +
scale_x_continuous("", expand = c(0,0), limits = c(30, 90), position = "top") +
scale_color_gradientn(colors = palette)
# Add a title and caption
ggplot(gm2007, aes(x = lifeExp, y = country, color = lifeExp)) +
geom_point(size = 4) +
geom_segment(aes(xend = 30, yend = country), size = 2) +
geom_text(aes(label = round(lifeExp,1)), color = "white", size = 1.5) +
scale_x_continuous("", expand = c(0,0), limits = c(30,90), position = "top") +
scale_color_gradientn(colors = palette) +
labs(title = "Highest and lowest life expectancies, 2007", caption = "Source: gapminder")
Let’s continue adding to this plot in the next exercise.
In the previous exercise, we completed our basic plot. Now let’s polish it by playing with the theme and adding annotations. In this exercise, you’ll use annotate() to add text and a curve to the plot.
The following values have been calculated for you to assist with adding embellishments to the plot:
global_mean <- mean(gm2007_full$lifeExp)
x_start <- global_mean + 4
y_start <- 5.5
x_end <- global_mean
y_end <- 7.5
Let’s assign our previous plot to plt_country_vs_lifeExp.
# Add a title and caption
plt_country_vs_lifeExp <- ggplot(gm2007, aes(x = lifeExp, y = country, color = lifeExp)) +
geom_point(size = 4) +
geom_segment(aes(xend = 30, yend = country), size = 2) +
geom_text(aes(label = round(lifeExp,1)), color = "white", size = 1.5) +
scale_x_continuous("", expand = c(0,0), limits = c(30,90), position = "top") +
scale_color_gradientn(colors = palette) +
labs(title = "Highest and lowest life expectancies, 2007", caption = "Source: gapminder")
# Define the theme
plt_country_vs_lifeExp +
theme_classic() +
theme(axis.line.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text = element_text(color = "black"),
axis.title = element_blank(),
legend.position = "none")
# Assign themes to step_1_theme
step_1_themes <- theme_classic() +
theme(axis.line.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text = element_text(color = "black"),
axis.title = element_blank(),
legend.position = "none")
# Add a vertical line
plt_country_vs_lifeExp +
step_1_themes +
geom_vline(xintercept = global_mean, color = "grey40", linetype = 3)
# Add text
plt_country_vs_lifeExp +
step_1_themes +
geom_vline(xintercept = global_mean, color = "grey40", linetype = 3) +
annotate(
"text",
x = x_start, y = y_start,
label = "The\nglobal\naverage",
vjust = 1, size = 3, color = "grey40"
)
# Assign annotation to step_3_annotation
step_3_annotation <- annotate(
"text",
x = x_start, y = y_start,
label = "The\nglobal\naverage",
vjust = 1, size = 3, color = "grey40"
)
# Add a curve
plt_country_vs_lifeExp +
step_1_themes +
geom_vline(xintercept = global_mean, color = "grey40", linetype = 3) +
step_3_annotation +
annotate(
"curve",
x = x_start, y = y_start,
xend = x_end, yend = y_end,
arrow = arrow(length = unit(0.2, "cm"), type = "closed"),
color = "grey40"
)
This ggplot2 course builds on your knowledge from the introductory course to produce meaningful explanatory plots. Statistics will be calculated on the fly and you’ll see how Coordinates and Facets aid in communication. You’ll also explore details of data visualization best practices with ggplot2 to help make sure you have a sound understanding of what works and why. By the end of the course, you’ll have all the tools needed to make a custom plotting function to explore a large data set, combining statistics and excellent visuals.
A picture paints a thousand words, which is why R ggplot2 is such a powerful tool for graphical data analysis. In this chapter, you’ll progress from simply plotting data to applying a variety of statistical methods. These include a variety of linear models, descriptive and inferential statistics (mean, standard deviation and confidence intervals) and custom functions.