CARGA DE LIBRERIAS

3.- Introduction to the Tidyverse

Data wrangling

Loading the gapminder and dplyr packages

Before you can work with the gapminder dataset, you’ll need to load two R packages that contain the tools for working with it, then display the gapminder dataset so that you can see what it contains.

To your right, you’ll see two windows inside which you can enter code: The script.R window, and the R Console. All of your code to solve each exercise must go inside script.R.

If you hit Submit Answer, your R script is executed and the output is shown in the R Console. DataCamp checks whether your submission is correct and gives you feedback. You can hit Submit Answer as often as you want. If you’re stuck, you can ask for a hint or a solution.

You can use the R Console interactively by simply typing R code and hitting Enter. When you work in the console directly, your code will not be checked for correctness so it is a great way to experiment and explore.

This course introduces a lot of new concepts, so if you ever need a quick refresher, download the tidyverse for beginners Cheat Sheet and keep it handy!

# Load the gapminder package
library(gapminder)

# Load the dplyr package
library(dplyr)

# Look at the gapminder dataset
gapminder
## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ... with 1,694 more rows

Filtering for one year

The filter verb extracts particular observations based on a condition. In this exercise you’ll filter for observations from a particular year.

Exercise

Add a filter() line after the pipe (%>%) to extract only the observations from the year 1957. Remember that you use == to compare two values.

library(gapminder)
library(dplyr)

# Filter the gapminder dataset for the year 1957
gapminder %>% filter(year == 1957)
## # A tibble: 142 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1957    30.3  9240934      821.
##  2 Albania     Europe     1957    59.3  1476505     1942.
##  3 Algeria     Africa     1957    45.7 10270856     3014.
##  4 Angola      Africa     1957    32.0  4561361     3828.
##  5 Argentina   Americas   1957    64.4 19610538     6857.
##  6 Australia   Oceania    1957    70.3  9712569    10950.
##  7 Austria     Europe     1957    67.5  6965860     8843.
##  8 Bahrain     Asia       1957    53.8   138655    11636.
##  9 Bangladesh  Asia       1957    39.3 51365468      662.
## 10 Belgium     Europe     1957    69.2  8989111     9715.
## # ... with 132 more rows

Filtering for one country and one year

You can also use the filter() verb to set two conditions, which could retrieve a single observation.

Just like in the last exercise, you can do this in two lines of code, starting with gapminder %>% and having the filter() on the second line. Keeping one verb on each line helps keep the code readable. Note that each time, you’ll put the pipe %>% at the end of the first line (like gapminder %>%); putting the pipe at the beginning of the second line will throw an error.

EXERCISE:

Filter the gapminder data to retrieve only the observation from China in the year 2002.

library(gapminder)
library(dplyr)
gapminder
## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ... with 1,694 more rows
# Filter for China in 2002
gapminder %>% filter(year == 2002 & country == "China")
## # A tibble: 1 x 6
##   country continent  year lifeExp        pop gdpPercap
##   <fct>   <fct>     <int>   <dbl>      <int>     <dbl>
## 1 China   Asia       2002    72.0 1280400000     3119.

Arranging observations by life expectancy

You use arrange() to sort observations in ascending or descending order of a particular variable. In this case, you’ll sort the dataset based on the lifeExp variable.

EXERCISE:

Sort the gapminder dataset in ascending order of life expectancy (lifeExp). Sort the gapminder dataset in descending order of life expectancy.

library(gapminder)
library(dplyr)

# Sort in ascending order of lifeExp

gapminder %>% arrange(lifeExp)
## # A tibble: 1,704 x 6
##    country      continent  year lifeExp     pop gdpPercap
##    <fct>        <fct>     <int>   <dbl>   <int>     <dbl>
##  1 Rwanda       Africa     1992    23.6 7290203      737.
##  2 Afghanistan  Asia       1952    28.8 8425333      779.
##  3 Gambia       Africa     1952    30    284320      485.
##  4 Angola       Africa     1952    30.0 4232095     3521.
##  5 Sierra Leone Africa     1952    30.3 2143249      880.
##  6 Afghanistan  Asia       1957    30.3 9240934      821.
##  7 Cambodia     Asia       1977    31.2 6978607      525.
##  8 Mozambique   Africa     1952    31.3 6446316      469.
##  9 Sierra Leone Africa     1957    31.6 2295678     1004.
## 10 Burkina Faso Africa     1952    32.0 4469979      543.
## # ... with 1,694 more rows
# Sort in descending order of lifeExp
gapminder %>% arrange(desc(lifeExp))
## # A tibble: 1,704 x 6
##    country          continent  year lifeExp       pop gdpPercap
##    <fct>            <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Japan            Asia       2007    82.6 127467972    31656.
##  2 Hong Kong, China Asia       2007    82.2   6980412    39725.
##  3 Japan            Asia       2002    82   127065841    28605.
##  4 Iceland          Europe     2007    81.8    301931    36181.
##  5 Switzerland      Europe     2007    81.7   7554661    37506.
##  6 Hong Kong, China Asia       2002    81.5   6762476    30209.
##  7 Australia        Oceania    2007    81.2  20434176    34435.
##  8 Spain            Europe     2007    80.9  40448191    28821.
##  9 Sweden           Europe     2007    80.9   9031088    33860.
## 10 Israel           Asia       2007    80.7   6426679    25523.
## # ... with 1,694 more rows

Filtering and arranging

You’ll often need to use the pipe operator (%>%) to combine multiple dplyr verbs in a row. In this case, you’ll combine a filter() with an arrange() to find the highest population countries in a particular year.

EXERCISE:

Use filter() to extract observations from just the year 1957, then use arrange() to sort in descending order of population (pop).

library(gapminder)
library(dplyr)

# Filter for the year 1957, then arrange in descending order of population
gapminder %>% filter(year == 1957) %>%
    arrange(desc(pop))
## # A tibble: 142 x 6
##    country        continent  year lifeExp       pop gdpPercap
##    <fct>          <fct>     <int>   <dbl>     <int>     <dbl>
##  1 China          Asia       1957    50.5 637408000      576.
##  2 India          Asia       1957    40.2 409000000      590.
##  3 United States  Americas   1957    69.5 171984000    14847.
##  4 Japan          Asia       1957    65.5  91563009     4318.
##  5 Indonesia      Asia       1957    39.9  90124000      859.
##  6 Germany        Europe     1957    69.1  71019069    10188.
##  7 Brazil         Americas   1957    53.3  65551171     2487.
##  8 United Kingdom Europe     1957    70.4  51430000    11283.
##  9 Bangladesh     Asia       1957    39.3  51365468      662.
## 10 Italy          Europe     1957    67.8  49182000     6249.
## # ... with 132 more rows

Using mutate to change or create a column

Suppose we want life expectancy to be measured in months instead of years: you’d have to multiply the existing value by 12. You can use the mutate() verb to change this column, or to create a new column that’s calculated this way.

EXERCISE

Use mutate() to change the existing lifeExp column, by multiplying it by 12: 12 * lifeExp. Use mutate() to add a new column, called lifeExpMonths, calculated as 12 * lifeExp.

library(gapminder)
library(dplyr)

# Use mutate to change lifeExp to be in months
gapminder %>% mutate(lifeExp = 12 * lifeExp)
## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    346.  8425333      779.
##  2 Afghanistan Asia       1957    364.  9240934      821.
##  3 Afghanistan Asia       1962    384. 10267083      853.
##  4 Afghanistan Asia       1967    408. 11537966      836.
##  5 Afghanistan Asia       1972    433. 13079460      740.
##  6 Afghanistan Asia       1977    461. 14880372      786.
##  7 Afghanistan Asia       1982    478. 12881816      978.
##  8 Afghanistan Asia       1987    490. 13867957      852.
##  9 Afghanistan Asia       1992    500. 16317921      649.
## 10 Afghanistan Asia       1997    501. 22227415      635.
## # ... with 1,694 more rows
# Use mutate to create a new column called lifeExpMonths
gapminder %>% mutate(lifeExpMonths = 12 * lifeExp)
## # A tibble: 1,704 x 7
##    country     continent  year lifeExp      pop gdpPercap lifeExpMonths
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>         <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.          346.
##  2 Afghanistan Asia       1957    30.3  9240934      821.          364.
##  3 Afghanistan Asia       1962    32.0 10267083      853.          384.
##  4 Afghanistan Asia       1967    34.0 11537966      836.          408.
##  5 Afghanistan Asia       1972    36.1 13079460      740.          433.
##  6 Afghanistan Asia       1977    38.4 14880372      786.          461.
##  7 Afghanistan Asia       1982    39.9 12881816      978.          478.
##  8 Afghanistan Asia       1987    40.8 13867957      852.          490.
##  9 Afghanistan Asia       1992    41.7 16317921      649.          500.
## 10 Afghanistan Asia       1997    41.8 22227415      635.          501.
## # ... with 1,694 more rows

Combining filter, mutate, and arrange

In this exercise, you’ll combine all three of the verbs you’ve learned in this chapter, to find the countries with the highest life expectancy, in months, in the year 2007.

EXERCISE:

In one sequence of pipes on the gapminder dataset: filter() for observations from the year 2007, mutate() to create a column lifeExpMonths, calculated as 12 * lifeExp, and arrange() in descending order of that new column

library(gapminder)
library(dplyr)

# Filter, mutate, and arrange the gapminder dataset

gapminder %>% filter(year ==2007) %>%
mutate(lifeExpMonths = 12 * lifeExp) %>%
arrange(desc(lifeExpMonths))
## # A tibble: 142 x 7
##    country          continent  year lifeExp       pop gdpPercap lifeExpMonths
##    <fct>            <fct>     <int>   <dbl>     <int>     <dbl>         <dbl>
##  1 Japan            Asia       2007    82.6 127467972    31656.          991.
##  2 Hong Kong, China Asia       2007    82.2   6980412    39725.          986.
##  3 Iceland          Europe     2007    81.8    301931    36181.          981.
##  4 Switzerland      Europe     2007    81.7   7554661    37506.          980.
##  5 Australia        Oceania    2007    81.2  20434176    34435.          975.
##  6 Spain            Europe     2007    80.9  40448191    28821.          971.
##  7 Sweden           Europe     2007    80.9   9031088    33860.          971.
##  8 Israel           Asia       2007    80.7   6426679    25523.          969.
##  9 France           Europe     2007    80.7  61083916    30470.          968.
## 10 Canada           Americas   2007    80.7  33390141    36319.          968.
## # ... with 132 more rows

Data visualization

Variable assignment

Throughout the exercises in this chapter, you’ll be visualizing a subset of the gapminder data from the year 1952. First, you’ll have to load the ggplot2 package, and create a gapminder_1952 dataset to visualize.

By the way, if you haven’t downloaded it already, check out the tidyverse for beginners Cheat Sheet. It includes an overview of the most important concepts, functions and methods and might come in handy if you ever need a quick refresher!: https://datacamp-community-prod.s3.amazonaws.com/e63a8f6b-2aa3-4006-89e0-badc294b179c

EXERCISE:

Load the ggplot2 package after the gapminder and dplyr packages. Filter gapminder for observations from the year 1952, and assign it to a new dataset gapminder_1952 using the assignment operator (<-).

# Load the ggplot2 package as well
library(gapminder)
library(dplyr)


# Create gapminder_1952
library(ggplot2)
gapminder_1952 <- filter(gapminder, year == 1952)

Comparing population and GDP per capita

In the video you learned to create a scatter plot with GDP per capita on the x-axis and life expectancy on the y-axis (the code for that graph is shown here). When you’re exploring data visually, you’ll often need to try different combinations of variables and aesthetics.

EXERCISE:

Change the scatter plot of gapminder_1952 so that (pop) is on the x-axis and GDP per capita (gdpPercap) is on the y-axis.

library(gapminder)
library(dplyr)
library(ggplot2)

gapminder_1952 <- gapminder %>%
  filter(year == 1952)

# Change to put pop on the x-axis and gdpPercap on the y-axis
ggplot(gapminder_1952, aes(x = pop, y = gdpPercap)) +
  geom_point()

Comparing population and life expectancy

In this exercise, you’ll use ggplot2 to create a scatter plot from scratch, to compare each country’s population with its life expectancy in the year 1952.

EXERCISE:

Create a scatter plot of gapminder_1952 with population (pop) is on the x-axis and life expectancy (lifeExp) on the y-axis.

library(gapminder)
library(dplyr)
library(ggplot2)

gapminder_1952 <- gapminder %>%
  filter(year == 1952)

# Create a scatter plot with pop on the x-axis and lifeExp on the y-axis
ggplot(gapminder_1952, aes( x = pop, y = lifeExp)) +
geom_point()

Putting the x-axis on a log scale

You previously created a scatter plot with population on the x-axis and life expectancy on the y-axis. Since population is spread over several orders of magnitude, with some countries having a much higher population than others, it’s a good idea to put the x-axis on a log scale.

EXERCISE:

Change the existing scatter plot (code provided) to put the x-axis (representing population) on a log scale.

library(gapminder)
library(dplyr)
library(ggplot2)

gapminder_1952 <- gapminder %>%
  filter(year == 1952)

# Change this plot to put the x-axis on a log scale
ggplot(gapminder_1952, aes(x = pop, y = lifeExp)) +
  geom_point() + 
  scale_x_log10()

Putting the x- and y- axes on a log scale

Suppose you want to create a scatter plot with population on the x-axis and GDP per capita on the y-axis. Both population and GDP per-capita are better represented with log scales, since they vary over many orders of magnitude.

EXERCISE:

Create a scatter plot with population (pop) on the x-axis and GDP per capita (gdpPercap) on the y-axis. Put both the x- and y- axes on a log scale.

library(gapminder)
library(dplyr)
library(ggplot2)

gapminder_1952 <- gapminder %>%
  filter(year == 1952)

# Scatter plot comparing pop and gdpPercap, with both axes on a log scale
ggplot(gapminder_1952, aes(x = pop, y = gdpPercap)) +
geom_point() +
scale_x_log10() +
scale_y_log10()

Adding color to a scatter plot

In this lesson you learned how to use the color aesthetic, which can be used to show which continent each point in a scatter plot represents.

EXERCISE:

Create a scatter plot with population (pop) on the x-axis, life expectancy (lifeExp) on the y-axis, and with continent (continent) represented by the color of the points. Put the x-axis on a log scale.

library(gapminder)
library(dplyr)
library(ggplot2)

gapminder_1952 <- gapminder %>%
  filter(year == 1952)
gapminder_1952
## # A tibble: 142 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Albania     Europe     1952    55.2  1282697     1601.
##  3 Algeria     Africa     1952    43.1  9279525     2449.
##  4 Angola      Africa     1952    30.0  4232095     3521.
##  5 Argentina   Americas   1952    62.5 17876956     5911.
##  6 Australia   Oceania    1952    69.1  8691212    10040.
##  7 Austria     Europe     1952    66.8  6927772     6137.
##  8 Bahrain     Asia       1952    50.9   120447     9867.
##  9 Bangladesh  Asia       1952    37.5 46886859      684.
## 10 Belgium     Europe     1952    68    8730405     8343.
## # ... with 132 more rows
# Scatter plot comparing pop and lifeExp, with color representing continent
ggplot(gapminder_1952, aes( x = pop, y = lifeExp, color = continent)) +
geom_point() +
scale_x_log10()+
scale_y_log10()

Adding size and color to a plot

In the last exercise, you created a scatter plot communicating information about each country’s population, life expectancy, and continent. Now you’ll use the size of the points to communicate even more.

EXERCISE:

Modify the scatter plot so that the size of the points represents each country’s GDP per capita (gdpPercap).

library(gapminder)
library(dplyr)
library(ggplot2)

gapminder_1952 <- gapminder %>%
  filter(year == 1952)

# Add the size aesthetic to represent a country's gdpPercap
ggplot(gapminder_1952, aes(x = pop, y = lifeExp, color = continent, size = gdpPercap)) +
  geom_point() +
  scale_x_log10()

Creating a subgraph for each continent

You’ve learned to use faceting to divide a graph into subplots based on one of its variables, such as the continent.

EXERCISE:

Create a scatter plot of gapminder_1952 with the x-axis representing population (pop), the y-axis representing life expectancy (lifeExp), and faceted to have one subplot per continent (continent). Put the x-axis on a log scale.

library(gapminder)
library(dplyr)
library(ggplot2)

gapminder_1952 <- gapminder %>%
  filter(year == 1952)

# Scatter plot comparing pop and lifeExp, faceted by continent
ggplot(gapminder_1952, aes(x= pop, y = lifeExp))+
geom_point()+
scale_x_log10()+
facet_wrap(~continent)

Faceting by year

All of the graphs in this chapter have been visualizing statistics within one year. Now that you’re able to use faceting, however, you can create a graph showing all the country-level data from 1952 to 2007, to understand how global statistics have changed over time.

EXERCISE:

Create a scatter plot of the gapminder data: Put GDP per capita (gdpPercap) on the x-axis and life expectancy (lifeExp) on the y-axis, with continent (continent) represented by color and population (pop) represented by size. Put the x-axis on a log scale Facet by the year variable

library(gapminder)
library(dplyr)
library(ggplot2)

# Scatter plot comparing gdpPercap and lifeExp, with color representing continent
# and size representing population, faceted by year
ggplot(gapminder, aes(x=gdpPercap, y = lifeExp, color = continent, size = pop)) + geom_point() +
scale_x_log10() +
facet_wrap(~year)

Grouping and summarizing

Summarizing the median life expectancy

You’ve seen how to find the mean life expectancy and the total population across a set of observations, but mean() and sum() are only two of the functions R provides for summarizing a collection of numbers. Here, you’ll learn to use the median() function in combination with summarize().

By the way, dplyr displays some messages when it’s loaded that we’ve been hiding so far. They’ll show up in red and start with:

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’: This will occur in future exercises each time you load dplyr: it’s mentioning some built-in functions that are overwritten by dplyr. You won’t need to worry about this message within this course.

EXERCISE:

Use the median() function within a summarize() to find the median life expectancy. Save it into a column called medianLifeExp.

library(gapminder)
library(dplyr)
gapminder
## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ... with 1,694 more rows
# Summarize to find the median life expectancy
gapminder %>%
summarize(medianLifeExp = median(lifeExp))
## # A tibble: 1 x 1
##   medianLifeExp
##           <dbl>
## 1          60.7

Summarizing the median life expectancy in 1957

Rather than summarizing the entire dataset, you may want to find the median life expectancy for only one particular year. In this case, you’ll find the median in the year 1957.

EXECISE:

Filter for the year 1957, then use the median() function within a summarize() to calculate the median life expectancy into a column called medianLifeExp.

library(gapminder)
library(dplyr)

# Filter for 1957 then summarize the median life expectancy
gapminder %>%
filter(year == 1957) %>%
summarize (medianLifeExp = median(lifeExp))
## # A tibble: 1 x 1
##   medianLifeExp
##           <dbl>
## 1          48.4

Summarizing multiple variables in 1957

The summarize() verb allows you to summarize multiple variables at once. In this case, you’ll use the median() function to find the median life expectancy and the max() function to find the maximum GDP per capita.

EXERCISE:

library(gapminder)
library(dplyr)

# Filter for 1957 then summarize the median life expectancy and the maximum GDP per capita
gapminder %>%
filter(year == 1957) %>%
summarize(medianLifeExp = median(lifeExp),maxGdpPercap = max (gdpPercap))
## # A tibble: 1 x 2
##   medianLifeExp maxGdpPercap
##           <dbl>        <dbl>
## 1          48.4      113523.

Summarizing by year

In a previous exercise, you found the median life expectancy and the maximum GDP per capita in the year 1957. Now, you’ll perform those two summaries within each year in the dataset, using the group_by verb.

EXERCISE:

library(gapminder)
library(dplyr)

# Find median life expectancy and maximum GDP per capita in each year
gapminder %>%
group_by(year) %>%
summarize(medianLifeExp = median(lifeExp), maxGdpPercap = max(gdpPercap))
## # A tibble: 12 x 3
##     year medianLifeExp maxGdpPercap
##    <int>         <dbl>        <dbl>
##  1  1952          45.1      108382.
##  2  1957          48.4      113523.
##  3  1962          50.9       95458.
##  4  1967          53.8       80895.
##  5  1972          56.5      109348.
##  6  1977          59.7       59265.
##  7  1982          62.4       33693.
##  8  1987          65.8       31541.
##  9  1992          67.7       34933.
## 10  1997          69.4       41283.
## 11  2002          70.8       44684.
## 12  2007          71.9       49357.

Summarizing by continent

You can group by any variable in your dataset to create a summary. Rather than comparing across time, you might be interested in comparing among continents. You’ll want to do that within one year of the dataset: let’s use 1957.

EXERCISE:

Filter the gapminder data for the year 1957. Then find the median life expectancy (lifeExp) and maximum GDP per capita (gdpPercap) within each continent, saving them into medianLifeExp and maxGdpPercap, respectively.

library(gapminder)
library(dplyr)

# Find median life expectancy and maximum GDP per capita in each continent in 1957
gapminder %>%
filter(year == 1957) %>%
group_by(continent) %>%
summarize(medianLifeExp = median(lifeExp), maxGdpPercap = max(gdpPercap))
## # A tibble: 5 x 3
##   continent medianLifeExp maxGdpPercap
##   <fct>             <dbl>        <dbl>
## 1 Africa             40.6        5487.
## 2 Americas           56.1       14847.
## 3 Asia               48.3      113523.
## 4 Europe             67.6       17909.
## 5 Oceania            70.3       12247.

Summarizing by continent and year

Instead of grouping just by year, or just by continent, you’ll now group by both continent and year to summarize within each.

EXERCISE

Find the median life expectancy (lifeExp) and maximum GDP per capita (gdpPercap) within each combination of continent and year, saving them into medianLifeExp and maxGdpPercap, respectively.

library(gapminder)
library(dplyr)

# Find median life expectancy and maximum GDP per capita in each continent/year combination

gapminder %>% 
group_by(continent, year)%>%
summarize(medianLifeExp = median(lifeExp),maxGdpPercap = max(gdpPercap))
## # A tibble: 60 x 4
## # Groups:   continent [5]
##    continent  year medianLifeExp maxGdpPercap
##    <fct>     <int>         <dbl>        <dbl>
##  1 Africa     1952          38.8        4725.
##  2 Africa     1957          40.6        5487.
##  3 Africa     1962          42.6        6757.
##  4 Africa     1967          44.7       18773.
##  5 Africa     1972          47.0       21011.
##  6 Africa     1977          49.3       21951.
##  7 Africa     1982          50.8       17364.
##  8 Africa     1987          51.6       11864.
##  9 Africa     1992          52.4       13522.
## 10 Africa     1997          52.8       14723.
## # ... with 50 more rows

Visualizing median life expectancy over time

In the last chapter, you summarized the gapminder data to calculate the median life expectancy within each year. This code is provided for you, and is saved (with <-) as the by_year dataset.

Now you can use the ggplot2 package to turn this into a visualization of changing life expectancy over time.

EXERCISE:

Use the by_year dataset to create a scatter plot showing the change of median life expectancy over time, with year on the x-axis and medianLifeExp on the y-axis. Be sure to add expand_limits(y = 0) to make sure the plot’s y-axis includes zero.

library(gapminder)
library(dplyr)
library(ggplot2)

by_year <- gapminder %>%
  group_by(year) %>%
  summarize(medianLifeExp = median(lifeExp),
            maxGdpPercap = max(gdpPercap))

# Create a scatter plot showing the change in medianLifeExp over time
ggplot(by_year, aes(x = year, y = medianLifeExp)) +
geom_point() +
expand_limits(y = 0)

Visualizing median GDP per capita per continent over time

In the last exercise you were able to see how the median life expectancy of countries changed over time. Now you’ll examine the median GDP per capita instead, and see how the trend differs among continents.

EXERCISE:

Summarize the gapminder dataset by continent and year, finding the median GDP per capita (gdpPercap) within each and putting it into a column called medianGdpPercap. Use the assignment operator <- to save this summarized data as by_year_continent. Create a scatter plot showing the change in medianGdpPercap by continent over time. Use color to distinguish between continents, and be sure to add expand_limits(y = 0) so that the y-axis starts at zero.

library(gapminder)
library(dplyr)
library(ggplot2)
# Summarize medianGdpPercap within each continent within each year: by_year_continent
by_year_continent <- gapminder %>%
group_by(continent, year)%>%
summarize(medianGdpPercap = median(gdpPercap))

# Plot the change in medianGdpPercap in each continent over time
ggplot(by_year_continent, aes(x = year, y = medianGdpPercap, color = continent)) +
geom_point()+
expand_limits(y = 0)

Comparing median life expectancy and median GDP per continent in 2007

In these exercises you’ve generally created plots that show change over time. But as another way of exploring your data visually, you can also use ggplot2 to plot summarized data to compare continents within a single year.

EXERCISE:

Filter the gapminder dataset for the year 2007, then summarize the median GDP per capita and the median life expectancy within each continent, into columns called medianLifeExp and medianGdpPercap. Save this as by_continent_2007. Use the by_continent_2007 data to create a scatterplot comparing these summary statistics for continents in 2007, putting the median GDP per capita on the x-axis to the median life expectancy on the y-axis. Color the scatter plot by continent. You don’t need to add expand_limits(y = 0) for this plot.

library(gapminder)
library(dplyr)
library(ggplot2)
gapminder
## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ... with 1,694 more rows
# Summarize the median GDP and median life expectancy per continent in 2007
by_continent_2007 <- gapminder %>%
filter(year==2007)%>%
group_by(continent) %>%
summarize(medianLifeExp = median(lifeExp), medianGdpPercap = median(gdpPercap))

# Use a scatter plot to compare the median GDP and median life expectancy
ggplot(by_continent_2007, aes(x=medianGdpPercap, y = medianLifeExp, color = continent)) +
geom_point()

Types of visualizations

Visualizing median GDP per capita over time

A line plot is useful for visualizing trends over time. In this exercise, you’ll examine how the median GDP per capita has changed over time.

EXERCISE:

Use group_by() and summarize() to find the median GDP per capita within each year, calling the output column medianGdpPercap. Use the assignment operator <- to save it to a dataset called by_year. Use the by_year dataset to create a line plot showing the change in median GDP per capita over time. Be sure to use expand_limits(y = 0) to include 0 on the y-axis.

library(gapminder)
library(dplyr)
library(ggplot2)

# Summarize the median gdpPercap by year, then save it as by_year
by_year <- gapminder %>%
  group_by(year) %>%
  summarize(medianGdpPercap = median(gdpPercap))

# Create a line plot showing the change in medianGdpPercap over time
ggplot(by_year, aes(x = year, y = medianGdpPercap)) + geom_line() + expand_limits(y = 0)

Visualizing median GDP per capita by continent over time

In the last exercise you used a line plot to visualize the increase in median GDP per capita over time. Now you’ll examine the change within each continent.

EXECISE:

Use group_by() and summarize() to find the median GDP per capita within each year and continent, calling the output column medianGdpPercap. Use the assignment operator <- to save it to a dataset called by_year_continent. Use the by_year_continent dataset to create a line plot showing the change in median GDP per capita over time, with color representing continent. Be sure to use expand_limits(y = 0) to include 0 on the y-axis.

library(gapminder)
library(dplyr)
library(ggplot2)

# Summarize the median gdpPercap by year & continent, save as by_year_continent
by_year_continent <- gapminder %>%
  group_by(year, continent) %>%
  summarize(medianGdpPercap = median(gdpPercap))

# Create a line plot showing the change in medianGdpPercap by continent over time
ggplot(by_year_continent, aes(x = year, y = medianGdpPercap, color = continent)) + geom_line() + expand_limits(y = 0)

Visualizing median GDP per capita by continent

A bar plot is useful for visualizing summary statistics, such as the median GDP in each continent.

EXERCISE:

Use group_by() and summarize() to find the median GDP per capita within each continent in the year 1952, calling the output column medianGdpPercap. Use the assignment operator <- to save it to a dataset called by_continent. Use the by_continent dataset to create a bar plot showing the median GDP per capita in each continent.

library(gapminder)
library(dplyr)
library(ggplot2)

# Summarize the median gdpPercap by continent in 1952
by_continent <- gapminder %>%
  group_by(continent) %>%
  filter(year == 1952) %>%
  summarize(medianGdpPercap = median(gdpPercap))

# Create a bar plot showing medianGdp by continent

ggplot(by_continent, aes(x = continent, y = medianGdpPercap)) + geom_col()

Visualizing GDP per capita by country in Oceania

You’ve created a plot where each bar represents one continent, showing the median GDP per capita for each. But the x-axis of the bar plot doesn’t have to be the continent: you can instead create a bar plot where each bar represents a country.

In this exercise, you’ll create a bar plot comparing the GDP per capita between the two countries in the Oceania continent (Australia and New Zealand).

EXECISE

Filter for observations in the Oceania continent in the year 1952. Save this as oceania_1952. Use the oceania_1952 dataset to create a bar plot, with country on the x-axis and gdpPercap on the y-axis.

library(gapminder)
library(dplyr)
library(ggplot2)

# Filter for observations in the Oceania continent in 1952
oceania_1952 <- gapminder %>%
  filter(year == 1952, continent == "Oceania")

# Create a bar plot of gdpPercap by country
ggplot(oceania_1952, aes(x = country, y = gdpPercap)) + geom_col()

Visualizing population

A histogram is useful for examining the distribution of a numeric variable. In this exercise, you’ll create a histogram showing the distribution of country populations (by millions) in the year 1952.

Code for generating this dataset, gapminder_1952, is provided.

EXERCISE

Use the gapminder_1952 dataset to create a histogram of country population (pop_by_mil) in the year 1952. Inside the histogram geom, set the number of bins to 50.

library(gapminder)
library(dplyr)
library(ggplot2)

gapminder_1952 <- gapminder %>%
  filter(year == 1952) %>%
  mutate(pop_by_mil = pop / 1000000)

# Create a histogram of population (pop_by_mil)
ggplot(gapminder_1952, aes(x = pop_by_mil)) +
  geom_histogram(bins = 50)

Visualizing population with x-axis on a log scale

In the last exercise you created a histogram of populations across countries. You might have noticed that there were several countries with a much higher population than others, which causes the distribution to be very skewed, with most of the distribution crammed into a small part of the graph. (Consider that it’s hard to tell the median or the minimum population from that histogram).

To make the histogram more informative, you can try putting the x-axis on a log scale.

EXERCISE:

Use the gapminder_1952 dataset (code is provided) to create a histogram of country population (pop) in the year 1952, putting the x-axis on a log scale with scale_x_log10().

library(gapminder)
library(dplyr)
library(ggplot2)

gapminder_1952 <- gapminder %>%
  filter(year == 1952)

# Create a histogram of population (pop), with x on a log scale
ggplot(gapminder_1952, aes(x = pop)) + geom_histogram(bins=30) + scale_x_log10()

Comparing GDP per capita across continents

A boxplot is useful for comparing a distribution of values across several groups. In this exercise, you’ll examine the distribution of GDP per capita by continent. Since GDP per capita varies across several orders of magnitude, you’ll need to put the y-axis on a log scale.

EXERCISE:

Use the gapminder_1952 dataset (code is provided) to create a boxplot comparing GDP per capita (gdpPercap) among continents. Put the y-axis on a log scale with scale_y_log10().

library(gapminder)
library(dplyr)
library(ggplot2)

gapminder_1952 <- gapminder %>%
  filter(year == 1952)

# Create a boxplot comparing gdpPercap among continents
ggplot(gapminder_1952, aes(x = continent, y = gdpPercap)) + geom_boxplot() + scale_y_log10()

Adding a title to your graph

There are many other options for customizing a ggplot2 graph, which you can learn about in other DataCamp courses. You can also learn about them from online resources, which is an important skill to develop.

As the final exercise in this course, you’ll practice looking up ggplot2 instructions by completing a task we haven’t shown you how to do.

EXERCISE:

Add a title to the graph: Comparing GDP per capita across continents. Use a search engine, such as Google or Bing, to learn how to do so. After this exercise you are almost done with your course. If you enjoyed the material, feel free to send Dave a thank you via twitter. He’ll appreciate it. Tweet to Dave

library(gapminder)
library(dplyr)
library(ggplot2)

gapminder_1952 <- gapminder %>%
  filter(year == 1952)

# Add a title to this graph: "Comparing GDP per capita across continents"
ggplot(gapminder_1952, aes(x = continent, y = gdpPercap)) +
  geom_boxplot() +
  scale_y_log10() + ggtitle("Comparing GDP per capita across continents")

4.- Data Manipulation with dplyr

Transforming Data with dplyr

Selecting columns

Select the following four columns from the counties variable:

state county population poverty You don’t need to save the result to a variable.

EXERCISES:

Select the columns listed from the counties variable.

# Select the columns 
counties <- read.csv("./Data/counties.csv")

The filter and arrange verbs

Arranging observations

Here you see the counties_selected dataset with a few interesting variables selected. These variables: private_work, public_work, self_employed describe whether people work for the government, for private companies, or for themselves.

In these exercises, you’ll sort these observations to find the most interesting cases.

EXERCISES:

Add a verb to sort the observations of the public_work variable in descending order.

counties_selected <- counties[,c(3,4,7,36,37,38)]
  

# Add a verb to sort in descending order of public_work
counties_selected %>% arrange(desc(public_work)) %>% top_n(n = 10)
## Selecting by self_employed
##           state    county population private_work public_work self_employed
## 1       Montana    McCone       1728         48.8        22.2          27.8
## 2       Montana  Garfield       1047         40.2        20.9          36.6
## 3      Nebraska      Loup        548         51.2        19.5          29.3
## 4      Nebraska Keya Paha        711         48.0        16.6          34.3
## 5       Montana    Carter       1227         50.0        16.0          30.9
## 6  North Dakota    Divide       2314         54.9        15.3          29.8
## 7  North Dakota     Grant       2362         52.4        13.3          32.8
## 8  North Dakota     Logan       1945         58.4        11.8          27.8
## 9  South Dakota    Haakon       2083         59.3        11.7          28.5
## 10      Montana Wheatland       2115         51.4         9.5          33.0

Filtering for conditions

You use the filter() verb to get only observations that match a particular condition, or match multiple conditions.

EXERCISES:

Find only the counties that have a population above one million (1000000).

# Filter for counties with a population above 1000000
counties_selected %>% filter(population > 1000000) %>% top_n(n=10)
## Selecting by self_employed
##         state       county population private_work public_work self_employed
## 1  California      Alameda    1584983         78.7        13.8           7.4
## 2  California Contra Costa    1096068         77.5        13.6           8.7
## 3  California  Los Angeles   10038388         79.0        11.5           9.4
## 4  California       Orange    3116069         81.8        10.2           7.8
## 5  California    Riverside    2298032         77.1        14.9           7.8
## 6  California   Sacramento    1465832         70.8        21.8           7.3
## 7  California    San Diego    3223096         77.3        14.8           7.7
## 8     Florida   Miami-Dade    2639042         81.9        10.2           7.7
## 9    New York     New York    1629507         83.6         8.5           7.8
## 10      Texas       Travis    1121645         76.5        16.0           7.4

Find only the counties in the state of California that also have a population above one million (1000000).

# Filter for counties in the state of California that have a population above 1000000
counties_selected %>% filter(state == "California", population > 1000000) %>% top_n(n=10)
## Selecting by self_employed
##        state         county population private_work public_work self_employed
## 1 California        Alameda    1584983         78.7        13.8           7.4
## 2 California   Contra Costa    1096068         77.5        13.6           8.7
## 3 California    Los Angeles   10038388         79.0        11.5           9.4
## 4 California         Orange    3116069         81.8        10.2           7.8
## 5 California      Riverside    2298032         77.1        14.9           7.8
## 6 California     Sacramento    1465832         70.8        21.8           7.3
## 7 California San Bernardino    2094769         76.4        16.7           6.7
## 8 California      San Diego    3223096         77.3        14.8           7.7
## 9 California    Santa Clara    1868149         84.3         9.3           6.4

Filtering and arranging

We’re often interested in both filtering and sorting a dataset, to focus on observations of particular interest to you. Here, you’ll find counties that are extreme examples of what fraction of the population works in the private sector.

EXERCISES:

Filter for counties in the state of Texas that have more than ten thousand people (10000), and sort them in descending order of the percentage of people employed in private work.

counties_selected %>% top_n ( n= 10)
## Selecting by self_employed
##           state    county population private_work public_work self_employed
## 1       Montana    Carter       1227         50.0        16.0          30.9
## 2       Montana  Garfield       1047         40.2        20.9          36.6
## 3       Montana    McCone       1728         48.8        22.2          27.8
## 4       Montana Wheatland       2115         51.4         9.5          33.0
## 5      Nebraska Keya Paha        711         48.0        16.6          34.3
## 6      Nebraska      Loup        548         51.2        19.5          29.3
## 7  North Dakota    Divide       2314         54.9        15.3          29.8
## 8  North Dakota     Grant       2362         52.4        13.3          32.8
## 9  North Dakota     Logan       1945         58.4        11.8          27.8
## 10 South Dakota    Haakon       2083         59.3        11.7          28.5
# Filter for Texas and more than 10000 people; sort in descending order of private_work
counties_selected %>% filter(state == "Texas", population >10000)%>%
arrange(desc(private_work)) %>% top_n ( n= 10)
## Selecting by self_employed
##    state    county population private_work public_work self_employed
## 1  Texas   Aransas      24292         75.4        12.4          12.2
## 2  Texas Gillespie      25398         74.8        10.5          14.3
## 3  Texas  Live Oak      11873         73.5        14.2          12.1
## 4  Texas    Lavaca      19549         73.4        14.4          12.1
## 5  Texas  Franklin      10599         72.3        13.7          13.3
## 6  Texas  Comanche      13623         72.2        15.0          12.3
## 7  Texas     Llano      19323         72.0        11.4          16.4
## 8  Texas      Leon      16819         69.4        14.5          15.7
## 9  Texas   Trinity      14405         68.8        17.8          12.1
## 10 Texas Red River      12567         68.5        17.4          13.8
## 11 Texas   Runnels      10445         67.7        19.9          12.1

Mutate

Calculating the number of government employees

In the video, you used the unemployment variable, which is a percentage, to calculate the number of unemployed people in each county. In this exercise, you’ll do the same with another percentage variable: public_work.

The code provided already selects the state, county, population, and public_work columns.

EXERCISES:

Use mutate() to add a column called public_workers to the dataset, with the number of people employed in public (government) work.

counties_selected <- counties %>%
  select(state, county, population, public_work)
counties_selected %>% top_n(n=10)
## Selecting by public_work
##           state                     county population public_work
## 1        Alaska Lake and Peninsula Borough       1474        51.6
## 2        Alaska  Yukon-Koyukuk Census Area       5644        61.7
## 3    California                     Lassen      32645        50.5
## 4        Hawaii                    Kalawao         85        64.1
## 5  North Dakota                      Sioux       4380        56.8
## 6  South Dakota                    Buffalo       2038        49.5
## 7  South Dakota                      Dewey       5579        49.2
## 8  South Dakota                       Todd       9942        55.0
## 9         Texas                     Kenedy        565        48.1
## 10    Wisconsin                  Menominee       4451        59.1
# Add a new column public_workers with the number of people employed in public work
counties_selected %>% mutate(public_workers = public_work * population / 100) %>% top_n(n=10)
## Selecting by public_workers
##         state         county population public_work public_workers
## 1     Arizona       Maricopa    4018143        11.7       470122.7
## 2  California    Los Angeles   10038388        11.5      1154414.6
## 3  California         Orange    3116069        10.2       317839.0
## 4  California      Riverside    2298032        14.9       342406.8
## 5  California     Sacramento    1465832        21.8       319551.4
## 6  California San Bernardino    2094769        16.7       349826.4
## 7  California      San Diego    3223096        14.8       477018.2
## 8    Illinois           Cook    5236393        11.5       602185.2
## 9    New York          Kings    2595259        14.4       373717.3
## 10      Texas         Harris    4356362        10.1       439992.6

Sort the new column in descending order.

# Sort in descending order of the public_workers column
counties_selected %>%
  mutate(public_workers = public_work * population / 100) %>% arrange(desc(public_workers)) %>% top_n(n=10)
## Selecting by public_workers
##         state         county population public_work public_workers
## 1  California    Los Angeles   10038388        11.5      1154414.6
## 2    Illinois           Cook    5236393        11.5       602185.2
## 3  California      San Diego    3223096        14.8       477018.2
## 4     Arizona       Maricopa    4018143        11.7       470122.7
## 5       Texas         Harris    4356362        10.1       439992.6
## 6    New York          Kings    2595259        14.4       373717.3
## 7  California San Bernardino    2094769        16.7       349826.4
## 8  California      Riverside    2298032        14.9       342406.8
## 9  California     Sacramento    1465832        21.8       319551.4
## 10 California         Orange    3116069        10.2       317839.0

Calculating the percentage of women in a county

The dataset includes columns for the total number (not percentage) of men and women in each county. You could use this, along with the population variable, to compute the fraction of men (or women) within each county.

In this exercise, you’ll select the relevant columns yourself.

EXERCISE:

Select the columns state, county, population, men, and women.

# Select the columns state, county, population, men, and women
counties_selected <- counties %>%
  select(state, county, population, men, women)

Select, mutate, filter, and arrange

In this exercise, you’ll put together everything you’ve learned in this chapter (select(), mutate(), filter() and arrange()), to find the counties with the highest proportion of men.

EXERCISE:

Select only the columns state, county, population, men, and women. Add a variable proportion_men with the fraction of the county’s population made up of men. Filter for counties with a population of at least ten thousand (10000). Arrange counties in descending order of their proportion of men.

counties %>%
  # Select the five columns 
  select(state,county, population,men, women) %>%
  # Add the proportion_men variable
  mutate(proportion_men = men/population) %>%
  # Filter for population of at least 10,000
  filter(population > 10000) %>%
  # Arrange proportion of men in descending order 
  arrange(desc(proportion_men))%>% top_n(n = 10)
## Selecting by proportion_men
##         state         county population   men women proportion_men
## 1    Virginia         Sussex      11864  8130  3734      0.6852664
## 2  California         Lassen      32645 21818 10827      0.6683412
## 3     Georgia  Chattahoochee      11914  7940  3974      0.6664428
## 4   Louisiana West Feliciana      15415 10228  5187      0.6635096
## 5     Florida          Union      15191  9830  5361      0.6470937
## 6       Texas          Jones      19978 12652  7326      0.6332966
## 7    Missouri         DeKalb      12782  8080  4702      0.6321389
## 8       Texas        Madison      13838  8648  5190      0.6249458
## 9    Virginia    Greensville      11760  7303  4457      0.6210034
## 10      Texas       Anderson      57915 35469 22446      0.6124320

Aggregating Data

Count

Counting by region

The counties dataset contains columns for region, state, population, and the number of citizens, which we selected and saved as the counties_selected table. In this exercise, you’ll focus on the region column.

counties_selected <- counties %>%
  select(region, state, population, citizens)

EXERCISE:

Use count() to find the number of counties in each region, using a second argument to sort in descending order.

# Use count to find the number of counties in each region

counties_selected %>% count(region, sort = TRUE)
##          region    n
## 1         South 1420
## 2 North Central 1054
## 3          West  447
## 4     Northeast  217

Counting citizens by state

You can weigh your count by particular variables rather than finding the number of counties. In this case, you’ll find the number of citizens in each state.

EXERCISE:

Count the number of counties in each state, weighted based on the citizens column, and sorted in descending order.

# Find number of counties per state, weighted by citizens
counties_selected %>% count(state, wt = citizens, sort = TRUE) %>% top_n(n = 10)
## Selecting by n
##             state        n
## 1      California 24280349
## 2           Texas 16864864
## 3         Florida 13933052
## 4        New York 13531404
## 5    Pennsylvania  9710416
## 6        Illinois  8979999
## 7            Ohio  8709050
## 8        Michigan  7380136
## 9  North Carolina  7107998
## 10        Georgia  6978660

Mutating and counting

You can combine multiple verbs together to answer increasingly complicated questions of your data. For example: “What are the US states where the most people walk to work?”

You’ll use the walk column, which offers a percentage of people in each county that walk to work, to add a new column and count based on it.

EXERCISE:

Use mutate() to calculate and add a column called population_walk, containing the total number of people who walk to work in a county.

Use a (weighted and sorted) count() to find the total number of people who walk to work in each state.

# Select the columns 
counties <- read.csv("./Data/counties.csv")
counties_selected<- counties
counties_selected %>%
  # Add population_walk containing the total number of people who walk to work 
  mutate(population_walk = population * walk / 100) %>%
  # Count weighted by the new column
  count(state, wt = population_walk, sort = TRUE) %>% top_n(n=10)
## Selecting by n
##            state         n
## 1       New York 1237938.2
## 2     California 1017963.7
## 3   Pennsylvania  505397.2
## 4          Texas  430783.4
## 5       Illinois  400345.6
## 6  Massachusetts  316765.0
## 7        Florida  284722.9
## 8     New Jersey  273047.2
## 9           Ohio  266911.0
## 10    Washington  239764.3

The group by, summarize and ungroup verbs

Summarizing

The summarize() verb is very useful for collapsing a large dataset into a single observation.

counties_selected <- counties %>%
  select(county, population, income, unemployment)

EXERCISE:

Summarize the counties dataset to find the following columns: min_population (with the smallest population), max_unemployment (with the maximum unemployment), and average_income (with the mean of the income variable).

# Summarize to find minimum population, maximum unemployment, and average income

counties_selected %>% summarize (min_population = min(population), max_unemployment = max(unemployment), average_income = mean(income))
##   min_population max_unemployment average_income
## 1             85             29.4          46832

Summarizing by state

Another interesting column is land_area, which shows the land area in square miles. Here, you’ll summarize both population and land area by state, with the purpose of finding the density (in people per square miles).

counties_selected <- counties %>%
  select(state, county, population, land_area)

EXERCISE:

Group the data by state, and summarize to create the columns total_area (with total area in square miles) and total_population (with total population).

# Group by state and find the total area and population
counties_selected %>% group_by(state)%>%
summarize(total_area = sum(land_area), total_population= sum(population)) %>% top_n(n=10)
## `summarise()` ungrouping output (override with `.groups` argument)
## Selecting by total_population
## # A tibble: 10 x 3
##    state          total_area total_population
##    <chr>               <dbl>            <int>
##  1 California        155779.         38421464
##  2 Florida            53625.         19645772
##  3 Georgia            57514.         10006693
##  4 Illinois           55519.         12873761
##  5 Michigan           56539.          9900571
##  6 New York           47126.         19673174
##  7 North Carolina     48618.          9845333
##  8 Ohio               40861.         11575977
##  9 Pennsylvania       44743.         12779559
## 10 Texas             260563.         26538497

Add a density column with the people per square mile, then arrange in descending order.

counties_selected %>%
  group_by(state) %>%
  summarize(total_area = sum(land_area),
            total_population = sum(population)) %>%
  mutate(density = total_population / total_area) %>%
  arrange(desc(density)) %>% top_n(n=10)
## `summarise()` ungrouping output (override with `.groups` argument)
## Selecting by density
## # A tibble: 10 x 4
##    state         total_area total_population density
##    <chr>              <dbl>            <int>   <dbl>
##  1 New Jersey         7354.          8904413   1211.
##  2 Rhode Island       1034.          1053661   1019.
##  3 Massachusetts      7800.          6705586    860.
##  4 Connecticut        4842.          3593222    742.
##  5 Maryland           9707.          5930538    611.
##  6 Delaware           1949.           926454    475.
##  7 New York          47126.         19673174    417.
##  8 Florida           53625.         19645772    366.
##  9 Pennsylvania      44743.         12779559    286.
## 10 Ohio              40861.         11575977    283.

Summarizing by state and region

You can group by multiple columns instead of grouping by one. Here, you’ll practice aggregating by state and region, and notice how useful it is for performing multiple aggregations in a row.

counties_selected <- counties %>%
  select(region, state, county, population)

EXERICISE:

Summarize to find the total population, as a column called total_pop, in each combination of region and state.

# Summarize to find the total population
counties_selected %>%
  group_by(region, state) %>%
  summarize(total_pop = sum(population))%>% top_n(n=10)
## `summarise()` regrouping output by 'region' (override with `.groups` argument)
## Selecting by total_pop
## # A tibble: 39 x 3
## # Groups:   region [4]
##    region        state     total_pop
##    <chr>         <chr>         <int>
##  1 North Central Illinois   12873761
##  2 North Central Indiana     6568645
##  3 North Central Iowa        3093526
##  4 North Central Kansas      2892987
##  5 North Central Michigan    9900571
##  6 North Central Minnesota   5419171
##  7 North Central Missouri    6045448
##  8 North Central Nebraska    1869365
##  9 North Central Ohio       11575977
## 10 North Central Wisconsin   5742117
## # ... with 29 more rows

Notice the tibble is still grouped by region; use another summarize() step to calculate two new columns: the average state population in each region (average_pop) and the median state population in each region (median_pop).

# Calculate the average_pop and median_pop columns 
counties_selected %>%
  group_by(region, state) %>%
  summarize(total_pop = sum(population)) %>%
  summarize(average_pop = mean(total_pop),
            median_pop = median(total_pop))
## `summarise()` regrouping output by 'region' (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 3
##   region        average_pop median_pop
##   <chr>               <dbl>      <dbl>
## 1 North Central    5627687.    5580644
## 2 Northeast        6221058.    3593222
## 3 South            7370486     4804098
## 4 West             5722755.    2798636

The top_n verb

Selecting a county from each region

Previously, you used the walk column, which offers a percentage of people in each county that walk to work, to add a new column and count to find the total number of people who walk to work in each county.

Now, you’re interested in finding the county within each region with the highest percentage of citizens who walk to work.

EXECISE:

counties <- read.csv("./Data/counties.csv")
counties_selected<- counties
# Group by region and find the greatest number of citizens who walk to work
counties_selected %>%
  group_by(region) %>%
  top_n(1, walk)
## # A tibble: 4 x 41
## # Groups:   region [4]
##       X census_id state county region metro population    men  women hispanic
##   <int>     <int> <chr> <chr>  <chr>  <chr>      <int>  <int>  <int>    <dbl>
## 1    68      2013 Alas~ Aleut~ West   Nonm~       3304   2198   1106     12  
## 2  1857     36061 New ~ New Y~ North~ Metro    1629507 769434 860073     25.8
## 3  2014     38051 Nort~ McInt~ North~ Nonm~       2759   1341   1418      0.9
## 4  2929     51678 Virg~ Lexin~ South  Nonm~       7071   4372   2699      3.9
## # ... with 31 more variables: white <dbl>, black <dbl>, native <dbl>,
## #   asian <dbl>, pacific <dbl>, citizens <int>, income <int>, income_err <int>,
## #   income_per_cap <int>, income_per_cap_err <int>, poverty <dbl>,
## #   child_poverty <dbl>, professional <dbl>, service <dbl>, office <dbl>,
## #   construction <dbl>, production <dbl>, drive <dbl>, carpool <dbl>,
## #   transit <dbl>, walk <dbl>, other_transp <dbl>, work_at_home <dbl>,
## #   mean_commute <dbl>, employed <int>, private_work <dbl>, public_work <dbl>,
## #   self_employed <dbl>, family_work <dbl>, unemployment <dbl>, land_area <dbl>

Finding the highest-income state in each region

You’ve been learning to combine multiple dplyr verbs together. Here, you’ll combine group_by(), summarize(), and top_n() to find the state in each region with the highest income.

When you group by multiple columns and then summarize, it’s important to remember that the summarize “peels off” one of the groups, but leaves the rest on. For example, if you group_by(X, Y) then summarize, the result will still be grouped by X.

EXERCISE:

Calculate the average income (as average_income) of counties within each region and state (notice the group_by() has already been done for you). Find the highest income state in each region.

counties_selected %>%
  group_by(region, state) %>%
  # Calculate average income
  summarize(average_income = mean(income)) %>%
  # Find the highest income state in each region
  top_n(1, average_income)
## `summarise()` regrouping output by 'region' (override with `.groups` argument)
## # A tibble: 4 x 3
## # Groups:   region [4]
##   region        state        average_income
##   <chr>         <chr>                 <dbl>
## 1 North Central North Dakota         55575.
## 2 Northeast     New Jersey           73014.
## 3 South         Maryland             69200.
## 4 West          Alaska               65125.

Using summarize, top_n, and count together

In this chapter, you’ve learned to use five dplyr verbs related to aggregation: count(), group_by(), summarize(), ungroup(), and top_n(). In this exercise, you’ll use all of them to answer a question: In how many states do more people live in metro areas than non-metro areas?

Recall that the metro column has one of the two values “Metro” (for high-density city areas) or “Nonmetro” (for suburban and country areas).

counties_selected <- counties %>%
  select(state, metro, population)

EXERCISE:

For each combination of state and metro, find the total population as total_pop.

# Find the total population for each combination of state and metro
counties_selected %>% group_by(state, metro) %>%
    summarize(total_pop = sum (population)) %>% top_n(n = 10)
## `summarise()` regrouping output by 'state' (override with `.groups` argument)
## Selecting by total_pop
## # A tibble: 97 x 3
## # Groups:   state [50]
##    state      metro    total_pop
##    <chr>      <chr>        <int>
##  1 Alabama    Metro      3671377
##  2 Alabama    Nonmetro   1159243
##  3 Alaska     Metro       494990
##  4 Alaska     Nonmetro    230471
##  5 Arizona    Metro      6295145
##  6 Arizona    Nonmetro    346783
##  7 Arkansas   Metro      1806867
##  8 Arkansas   Nonmetro   1151341
##  9 California Metro     37587429
## 10 California Nonmetro    834035
## # ... with 87 more rows

Extract the most populated row from each state, which will be either Metro or Nonmetro.

# Extract the most populated row for each state
counties_selected %>%
  group_by(state, metro) %>%
  summarize(total_pop = sum(population)) %>%
  top_n(n = 10) 
## `summarise()` regrouping output by 'state' (override with `.groups` argument)
## Selecting by total_pop
## # A tibble: 97 x 3
## # Groups:   state [50]
##    state      metro    total_pop
##    <chr>      <chr>        <int>
##  1 Alabama    Metro      3671377
##  2 Alabama    Nonmetro   1159243
##  3 Alaska     Metro       494990
##  4 Alaska     Nonmetro    230471
##  5 Arizona    Metro      6295145
##  6 Arizona    Nonmetro    346783
##  7 Arkansas   Metro      1806867
##  8 Arkansas   Nonmetro   1151341
##  9 California Metro     37587429
## 10 California Nonmetro    834035
## # ... with 87 more rows

Ungroup, then count how often Metro or Nonmetro appears to see how many states have more people living in those areas.

# Count the states with more people in Metro or Nonmetro areas
counties_selected %>%
  group_by(state, metro) %>%
  summarize(total_pop = sum(population)) %>%
  top_n(1, total_pop) %>%
  ungroup(metro) %>%
  count(metro) %>% top_n(n=10)
## `summarise()` regrouping output by 'state' (override with `.groups` argument)
## Selecting by n
## # A tibble: 50 x 3
## # Groups:   state [50]
##    state       metro     n
##    <chr>       <chr> <int>
##  1 Alabama     Metro     1
##  2 Alaska      Metro     1
##  3 Arizona     Metro     1
##  4 Arkansas    Metro     1
##  5 California  Metro     1
##  6 Colorado    Metro     1
##  7 Connecticut Metro     1
##  8 Delaware    Metro     1
##  9 Florida     Metro     1
## 10 Georgia     Metro     1
## # ... with 40 more rows

Selecting and Transforming Data

Selecting

Selecting columns

Using the select verb, we can answer interesting questions about our dataset by focusing in on related groups of verbs. The colon (:) is useful for getting many columns at a time.

EXERCISE:

Use glimpse() to examine all the variables in the counties table. Select the columns for state, county, population, and (using a colon) all five of those industry-related variables; there are five consecutive variables in the table related to the industry of people’s work: professional, service, office, construction, and production. Arrange the table in descending order of service to find which counties have the highest rates of working in the service industry.

# Glimpse the counties table
counties %>% glimpse() %>%
  # Arrange service in descending order 
arrange(desc(service)) %>% top_n(n=10)
## Rows: 3,138
## Columns: 41
## $ X                  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ census_id          <int> 1001, 1003, 1005, 1007, 1009, 1011, 1013, 1015, ...
## $ state              <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Ala...
## $ county             <chr> "Autauga", "Baldwin", "Barbour", "Bibb", "Blount...
## $ region             <chr> "South", "South", "South", "South", "South", "So...
## $ metro              <chr> "Metro", "Metro", "Nonmetro", "Metro", "Metro", ...
## $ population         <int> 55221, 195121, 26932, 22604, 57710, 10678, 20354...
## $ men                <int> 26745, 95314, 14497, 12073, 28512, 5660, 9502, 5...
## $ women              <int> 28476, 99807, 12435, 10531, 29198, 5018, 10852, ...
## $ hispanic           <dbl> 2.6, 4.5, 4.6, 2.2, 8.6, 4.4, 1.2, 3.5, 0.4, 1.5...
## $ white              <dbl> 75.8, 83.1, 46.2, 74.5, 87.9, 22.2, 53.3, 73.0, ...
## $ black              <dbl> 18.5, 9.5, 46.7, 21.4, 1.5, 70.7, 43.8, 20.3, 40...
## $ native             <dbl> 0.4, 0.6, 0.2, 0.4, 0.3, 1.2, 0.1, 0.2, 0.2, 0.6...
## $ asian              <dbl> 1.0, 0.7, 0.4, 0.1, 0.1, 0.2, 0.4, 0.9, 0.8, 0.3...
## $ pacific            <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...
## $ citizens           <int> 40725, 147695, 20714, 17495, 42345, 8057, 15581,...
## $ income             <int> 51281, 50254, 32964, 38678, 45813, 31938, 32229,...
## $ income_err         <int> 2391, 1263, 2973, 3995, 3141, 5884, 1793, 925, 2...
## $ income_per_cap     <int> 24974, 27317, 16824, 18431, 20532, 17580, 18390,...
## $ income_per_cap_err <int> 1080, 711, 798, 1618, 708, 2055, 714, 489, 1366,...
## $ poverty            <dbl> 12.9, 13.4, 26.7, 16.8, 16.7, 24.6, 25.4, 20.5, ...
## $ child_poverty      <dbl> 18.6, 19.2, 45.3, 27.9, 27.2, 38.4, 39.2, 31.6, ...
## $ professional       <dbl> 33.2, 33.1, 26.8, 21.5, 28.5, 18.8, 27.5, 27.3, ...
## $ service            <dbl> 17.0, 17.7, 16.1, 17.9, 14.1, 15.0, 16.6, 17.7, ...
## $ office             <dbl> 24.2, 27.1, 23.1, 17.8, 23.9, 19.7, 21.9, 24.2, ...
## $ construction       <dbl> 8.6, 10.8, 10.8, 19.0, 13.5, 20.1, 10.3, 10.5, 1...
## $ production         <dbl> 17.1, 11.2, 23.1, 23.7, 19.9, 26.4, 23.7, 20.4, ...
## $ drive              <dbl> 87.5, 84.7, 83.8, 83.2, 84.9, 74.9, 84.5, 85.3, ...
## $ carpool            <dbl> 8.8, 8.8, 10.9, 13.5, 11.2, 14.9, 12.4, 9.4, 11....
## $ transit            <dbl> 0.1, 0.1, 0.4, 0.5, 0.4, 0.7, 0.0, 0.2, 0.2, 0.2...
## $ walk               <dbl> 0.5, 1.0, 1.8, 0.6, 0.9, 5.0, 0.8, 1.2, 0.3, 0.6...
## $ other_transp       <dbl> 1.3, 1.4, 1.5, 1.5, 0.4, 1.7, 0.6, 1.2, 0.4, 0.7...
## $ work_at_home       <dbl> 1.8, 3.9, 1.6, 0.7, 2.3, 2.8, 1.7, 2.7, 2.1, 2.5...
## $ mean_commute       <dbl> 26.5, 26.4, 24.1, 28.8, 34.9, 27.5, 24.6, 24.1, ...
## $ employed           <int> 23986, 85953, 8597, 8294, 22189, 3865, 7813, 474...
## $ private_work       <dbl> 73.6, 81.5, 71.8, 76.8, 82.0, 79.5, 77.4, 74.1, ...
## $ public_work        <dbl> 20.9, 12.3, 20.8, 16.1, 13.5, 15.1, 16.2, 20.8, ...
## $ self_employed      <dbl> 5.5, 5.8, 7.3, 6.7, 4.2, 5.4, 6.2, 5.0, 2.8, 7.9...
## $ family_work        <dbl> 0.0, 0.4, 0.1, 0.4, 0.4, 0.0, 0.2, 0.1, 0.0, 0.5...
## $ unemployment       <dbl> 7.6, 7.5, 17.6, 8.3, 7.7, 18.0, 10.9, 12.3, 8.9,...
## $ land_area          <dbl> 594.44, 1589.78, 884.88, 622.58, 644.78, 622.81,...
## Selecting by land_area
##      X census_id      state                          county region    metro
## 1   84      2180     Alaska                Nome Census Area   West Nonmetro
## 2   91      2240     Alaska Southeast Fairbanks Census Area   West Nonmetro
## 3   82      2164     Alaska      Lake and Peninsula Borough   West Nonmetro
## 4   71      2050     Alaska              Bethel Census Area   West Nonmetro
## 5   92      2261     Alaska      Valdez-Cordova Census Area   West Nonmetro
## 6  221      6071 California                  San Bernardino   West    Metro
## 7   95      2290     Alaska       Yukon-Koyukuk Census Area   West Nonmetro
## 8   86      2188     Alaska        Northwest Arctic Borough   West Nonmetro
## 9   83      2170     Alaska       Matanuska-Susitna Borough   West    Metro
## 10  85      2185     Alaska             North Slope Borough   West Nonmetro
##    population     men   women hispanic white black native asian pacific
## 1        9854    5272    4582      2.0  16.5   0.7   69.8   1.2     0.2
## 2        7029    3909    3120      4.8  76.7   1.1   10.5   1.4     0.2
## 3        1474     731     743      2.2  21.0   0.5   67.6   3.2     0.7
## 4       17776    9351    8425      1.8  11.3   0.8   80.6   1.2     0.3
## 5        9617    5137    4480      4.4  70.1   0.0   14.8   3.2     0.4
## 6     2094769 1042053 1052716     51.1  31.2   8.1    0.4   6.5     0.3
## 7        5644    3038    2606      1.9  21.8   0.3   69.9   0.5     0.0
## 8        7732    4165    3567      1.8  11.8   0.3   81.5   0.8     0.0
## 9       96178   50205   45973      4.5  80.8   1.0    5.1   1.4     0.3
## 10       9667    6172    3495      3.6  32.2   0.5   47.4   5.3     5.6
##    citizens income income_err income_per_cap income_per_cap_err poverty
## 1      6377  48868       1988          19804                768    27.0
## 2      4807  62670       8682          30079               2460    14.5
## 3       974  50781       9403          23041               2272    16.8
## 4     11295  51012       2526          17746                768    25.2
## 5      7127  78810       5996          33293               3006    10.9
## 6   1292879  53433        571          21352                182    19.5
## 7      4028  38491       1654          20698                760    22.4
## 8      4953  63648       2525          21101               1158    23.8
## 9     68242  72983       1577          29913                679    10.0
## 10     6892  72576       8046          51085               3137    10.2
##    child_poverty professional service office construction production drive
## 1           32.5         34.7    22.4   22.1         11.4        9.4  25.8
## 2           21.1         30.4    21.2   15.6         21.5       11.2  52.3
## 3           23.5         34.1    20.7   13.4         16.3       15.5  21.2
## 4           31.0         35.2    19.3   24.4          8.1       13.0  26.5
## 5            8.9         35.4    18.9   19.0         20.5        6.2  63.2
## 6           27.1         28.7    18.9   25.4         10.2       16.7  77.8
## 7           30.0         35.7    18.7   19.5         15.6       10.5  28.7
## 8           30.2         33.8    18.2   22.4         14.9       10.8  16.5
## 9           12.5         32.8    16.6   22.3         17.9       10.5  72.0
## 10          14.7         24.0    15.0   18.4         25.0       17.5  20.1
##    carpool transit walk other_transp work_at_home mean_commute employed
## 1     10.0     0.3 36.9         22.7          4.3          7.0     3654
## 2     17.0     0.0 14.5          1.1         15.1         13.0     3038
## 3      6.8     1.1 36.2         32.4          2.4          6.0      657
## 4     12.7     0.5 33.0         22.6          4.8          7.4     6232
## 5      9.9     0.6 16.5          3.0          6.7          9.6     4415
## 6     13.3     1.7  1.8          1.2          4.2         30.4   829145
## 7      8.1     0.2 38.1         20.1          4.9          8.6     2077
## 8     10.4     0.4 46.9         21.2          4.6          7.4     2593
## 9     12.5     1.1  1.9          6.8          5.7         34.0    40854
## 10    17.0     2.8 37.9          7.9         14.3          8.4     5499
##    private_work public_work self_employed family_work unemployment land_area
## 1          50.6        44.0           5.2         0.2         16.5  22961.76
## 2          64.0        27.8           7.8         0.4          9.4  24768.81
## 3          42.2        51.6           6.1         0.2          9.8  23652.01
## 4          53.2        44.8           2.0         0.0         17.6  40570.00
## 5          56.6        32.9          10.1         0.4         10.7  34239.88
## 6          76.4        16.7           6.7         0.2         12.6  20056.94
## 7          33.3        61.7           5.1         0.0         18.2 145504.79
## 8          54.0        43.7           2.1         0.2         21.9  35572.58
## 9          70.6        21.5           7.8         0.2          9.8  24607.90
## 10         73.8        25.2           0.9         0.0          9.3  88695.41

Select helpers

In the video you learned about the select helper starts_with(). Another select helper is ends_with(), which finds the columns that end with a particular string.

EXERCISE:

Select the columns state, county, population, and all those that end with work. Filter just for the counties where at least 50% of the population is engaged in public work.

counties %>%
  # Filter for counties that have at least 50% of people engaged in public work
  filter(public_work >= 50)
##      X census_id        state                     county        region    metro
## 1   82      2164       Alaska Lake and Peninsula Borough          West Nonmetro
## 2   95      2290       Alaska  Yukon-Koyukuk Census Area          West Nonmetro
## 3  203      6035   California                     Lassen          West Nonmetro
## 4  547     15005       Hawaii                    Kalawao          West    Metro
## 5 2031     38085 North Dakota                      Sioux North Central    Metro
## 6 2419     46121 South Dakota                       Todd North Central Nonmetro
## 7 3083     55078    Wisconsin                  Menominee North Central Nonmetro
##   population   men women hispanic white black native asian pacific citizens
## 1       1474   731   743      2.2  21.0   0.5   67.6   3.2     0.7      974
## 2       5644  3038  2606      1.9  21.8   0.3   69.9   0.5     0.0     4028
## 3      32645 21818 10827     18.3  65.9   8.4    2.8   1.4     0.7    26125
## 4         85    42    43      4.7  37.6   0.0    0.0  21.2    35.3       80
## 5       4380  2215  2165      3.6  13.9   0.0   79.4   0.2     0.1     2735
## 6       9942  4862  5080      3.7  10.0   0.3   74.4   0.6     0.0     5915
## 7       4451  2185  2266      5.5  10.5   0.4   80.0   2.4     0.0     2978
##   income income_err income_per_cap income_per_cap_err poverty child_poverty
## 1  50781       9403          23041               2272    16.8          23.5
## 2  38491       1654          20698                760    22.4          30.0
## 3  51555       2789          19274               1210    16.4          23.6
## 4  66250       9909          46769               8448    15.2            NA
## 5  38068       4250          15557               1594    35.7          44.3
## 6  31128       4700          11616                931    46.4          56.0
## 7  35343       3363          14482               1173    35.2          53.0
##   professional service office construction production drive carpool transit
## 1         34.1    20.7   13.4         16.3       15.5  21.2     6.8     1.1
## 2         35.7    18.7   19.5         15.6       10.5  28.7     8.1     0.2
## 3         29.1    29.1   19.4         10.1       12.3  77.2    10.8     1.2
## 4         31.3    23.4   21.9          4.7       18.8  34.3     7.5     0.0
## 5         40.8    26.7   19.3          5.6        7.6  66.3    14.8     0.6
## 6         35.5    25.0   22.0         10.5        7.1  58.3    18.0     1.0
## 7         25.5    30.8   20.0          9.8       13.9  73.6    15.7     2.9
##   walk other_transp work_at_home mean_commute employed private_work public_work
## 1 36.2         32.4          2.4          6.0      657         42.2        51.6
## 2 38.1         20.1          4.9          8.6     2077         33.3        61.7
## 3  4.3          1.6          4.8         20.6     9082         42.6        50.5
## 4 40.3         14.9          3.0          9.2       64         25.0        64.1
## 5  5.5          0.8         12.0         17.1     1291         32.9        56.8
## 6  6.0          5.0         11.7         16.5     2901         34.4        55.0
## 7  2.8          0.1          4.9         17.3     1403         36.8        59.1
##   self_employed family_work unemployment land_area
## 1           6.1         0.2          9.8  23652.01
## 2           5.1         0.0         18.2 145504.79
## 3           6.8         0.1         10.9   4541.18
## 4          10.9         0.0          0.0     11.99
## 5          10.2         0.1         20.3   1094.09
## 6           9.8         0.8         25.4   1388.56
## 7           3.7         0.4         14.8    357.61

The rename verb

Renaming a column after count

The rename() verb is often useful for changing the name of a column that comes out of another verb, such as count(). In this exercise, you’ll rename the n column from count() (which you learned about in Chapter 2) to something more descriptive.

EXERCISE:

Use count() to determine how many counties are in each state.

# Count the number of counties in each state
counties %>% count(state)
##             state   n
## 1         Alabama  67
## 2          Alaska  28
## 3         Arizona  15
## 4        Arkansas  75
## 5      California  58
## 6        Colorado  64
## 7     Connecticut   8
## 8        Delaware   3
## 9         Florida  67
## 10        Georgia 159
## 11         Hawaii   5
## 12          Idaho  44
## 13       Illinois 102
## 14        Indiana  92
## 15           Iowa  99
## 16         Kansas 105
## 17       Kentucky 120
## 18      Louisiana  64
## 19          Maine  16
## 20       Maryland  24
## 21  Massachusetts  14
## 22       Michigan  83
## 23      Minnesota  87
## 24    Mississippi  82
## 25       Missouri 115
## 26        Montana  56
## 27       Nebraska  93
## 28         Nevada  17
## 29  New Hampshire  10
## 30     New Jersey  21
## 31     New Mexico  33
## 32       New York  62
## 33 North Carolina 100
## 34   North Dakota  53
## 35           Ohio  88
## 36       Oklahoma  77
## 37         Oregon  36
## 38   Pennsylvania  67
## 39   Rhode Island   5
## 40 South Carolina  46
## 41   South Dakota  65
## 42      Tennessee  95
## 43          Texas 253
## 44           Utah  29
## 45        Vermont  14
## 46       Virginia 133
## 47     Washington  39
## 48  West Virginia  55
## 49      Wisconsin  72
## 50        Wyoming  23

Notice the n column in the output; use rename() to rename that to num_counties.

# Rename the n column to num_counties
counties %>%
  count(state) %>%
  rename(num_counties = n)
##             state num_counties
## 1         Alabama           67
## 2          Alaska           28
## 3         Arizona           15
## 4        Arkansas           75
## 5      California           58
## 6        Colorado           64
## 7     Connecticut            8
## 8        Delaware            3
## 9         Florida           67
## 10        Georgia          159
## 11         Hawaii            5
## 12          Idaho           44
## 13       Illinois          102
## 14        Indiana           92
## 15           Iowa           99
## 16         Kansas          105
## 17       Kentucky          120
## 18      Louisiana           64
## 19          Maine           16
## 20       Maryland           24
## 21  Massachusetts           14
## 22       Michigan           83
## 23      Minnesota           87
## 24    Mississippi           82
## 25       Missouri          115
## 26        Montana           56
## 27       Nebraska           93
## 28         Nevada           17
## 29  New Hampshire           10
## 30     New Jersey           21
## 31     New Mexico           33
## 32       New York           62
## 33 North Carolina          100
## 34   North Dakota           53
## 35           Ohio           88
## 36       Oklahoma           77
## 37         Oregon           36
## 38   Pennsylvania           67
## 39   Rhode Island            5
## 40 South Carolina           46
## 41   South Dakota           65
## 42      Tennessee           95
## 43          Texas          253
## 44           Utah           29
## 45        Vermont           14
## 46       Virginia          133
## 47     Washington           39
## 48  West Virginia           55
## 49      Wisconsin           72
## 50        Wyoming           23

Renaming a column as part of a select

rename() isn’t the only way you can choose a new name for a column: you can also choose a name as part of a select().

EXERCISE:

Select the columns state, county, and poverty from the counties dataset; in the same step, rename the poverty column to poverty_rate.

# Select state, county, and poverty as poverty_rate
counties %>% 
    select(state, county, poverty_rate = poverty) %>% top_n(n=10)
## Selecting by poverty_rate
##           state       county poverty_rate
## 1       Alabama       Sumter         42.7
## 2       Georgia         Clay         42.2
## 3      Kentucky        Wolfe         43.0
## 4     Louisiana East Carroll         48.0
## 5   Mississippi    Claiborne         42.0
## 6   Mississippi       Holmes         43.4
## 7   Mississippi    Jefferson         48.7
## 8  South Dakota       Corson         45.6
## 9  South Dakota     Mellette         45.3
## 10 South Dakota         Todd         46.4

The transmute verb

Using transmute

As you learned in the video, the transmute verb allows you to control which variables you keep, which variables you calculate, and which variables you drop.

EXERCISE:

Keep only the state, county, and population columns, and add a new column, density, that contains the population per land_area. Filter for only counties with a population greater than one million. Sort the table in ascending order of density.

counties %>%
  # Keep the state, county, and populations columns, and add a density column
  transmute(state,county, population, density = population/land_area) %>%
  # Filter for counties with a population greater than one million 
  filter(population > 1000000)%>%
  # Sort density in ascending order
  arrange(density) 
##            state         county population    density
## 1     California San Bernardino    2094769   104.4411
## 2         Nevada          Clark    2035572   257.9472
## 3     California      Riverside    2298032   318.8841
## 4        Arizona       Maricopa    4018143   436.7480
## 5        Florida     Palm Beach    1378806   699.9868
## 6     California      San Diego    3223096   766.1943
## 7     Washington           King    2045756   966.9999
## 8          Texas         Travis    1121645  1132.7459
## 9        Florida   Hillsborough    1302884  1277.0743
## 10       Florida         Orange    1229039  1360.4142
## 11       Florida     Miami-Dade    2639042  1390.6382
## 12      Michigan        Oakland    1229503  1417.0332
## 13    California    Santa Clara    1868149  1448.0653
## 14          Utah      Salt Lake    1078958  1453.5728
## 15         Texas          Bexar    1825502  1472.3928
## 16    California     Sacramento    1465832  1519.5638
## 17       Florida        Broward    1843152  1523.5305
## 18    California   Contra Costa    1096068  1530.9495
## 19      New York        Suffolk    1501373  1646.1521
## 20  Pennsylvania      Allegheny    1231145  1686.3152
## 21 Massachusetts      Middlesex    1556116  1902.7610
## 22      Missouri      St. Louis    1001327  1971.8925
## 23      Maryland     Montgomery    1017859  2071.9776
## 24    California        Alameda    1584983  2144.7092
## 25     Minnesota       Hennepin    1197776  2163.6518
## 26         Texas        Tarrant    1914526  2216.8873
## 27          Ohio       Franklin    1215761  2284.4492
## 28    California    Los Angeles   10038388  2473.8011
## 29         Texas         Harris    4356362  2557.3309
## 30          Ohio       Cuyahoga    1263189  2762.9410
## 31         Texas         Dallas    2485003  2852.1291
## 32      Virginia        Fairfax    1128722  2886.9785
## 33      Michigan          Wayne    1778969  2906.4322
## 34    California         Orange    3116069  3941.5472
## 35      New York         Nassau    1354612  4757.6988
## 36      Illinois           Cook    5236393  5539.2223
## 37  Pennsylvania   Philadelphia    1555072 11596.3609
## 38      New York         Queens    2301139 21202.7919
## 39      New York          Bronx    1428357 33927.7197
## 40      New York          Kings    2595259 36645.8486
## 41      New York       New York    1629507 71375.6899

Choosing among the four verbs

In this chapter you’ve learned about the four verbs: select, mutate, transmute, and rename. Here, you’ll choose the appropriate verb for each situation. You won’t need to change anything inside the parentheses.

EXERCISES:

Choose the right verb for changing the name of the unemployment column to unemployment_rate

# Change the name of the unemployment column
CAMTFV <- counties %>%
  rename(unemployment_rate = unemployment) %>% top_n(n=10)
## Selecting by land_area
glimpse(CAMTFV)
## Rows: 10
## Columns: 41
## $ X                  <int> 71, 82, 83, 84, 85, 86, 91, 92, 95, 221
## $ census_id          <int> 2050, 2164, 2170, 2180, 2185, 2188, 2240, 2261, ...
## $ state              <chr> "Alaska", "Alaska", "Alaska", "Alaska", "Alaska"...
## $ county             <chr> "Bethel Census Area", "Lake and Peninsula Boroug...
## $ region             <chr> "West", "West", "West", "West", "West", "West", ...
## $ metro              <chr> "Nonmetro", "Nonmetro", "Metro", "Nonmetro", "No...
## $ population         <int> 17776, 1474, 96178, 9854, 9667, 7732, 7029, 9617...
## $ men                <int> 9351, 731, 50205, 5272, 6172, 4165, 3909, 5137, ...
## $ women              <int> 8425, 743, 45973, 4582, 3495, 3567, 3120, 4480, ...
## $ hispanic           <dbl> 1.8, 2.2, 4.5, 2.0, 3.6, 1.8, 4.8, 4.4, 1.9, 51.1
## $ white              <dbl> 11.3, 21.0, 80.8, 16.5, 32.2, 11.8, 76.7, 70.1, ...
## $ black              <dbl> 0.8, 0.5, 1.0, 0.7, 0.5, 0.3, 1.1, 0.0, 0.3, 8.1
## $ native             <dbl> 80.6, 67.6, 5.1, 69.8, 47.4, 81.5, 10.5, 14.8, 6...
## $ asian              <dbl> 1.2, 3.2, 1.4, 1.2, 5.3, 0.8, 1.4, 3.2, 0.5, 6.5
## $ pacific            <dbl> 0.3, 0.7, 0.3, 0.2, 5.6, 0.0, 0.2, 0.4, 0.0, 0.3
## $ citizens           <int> 11295, 974, 68242, 6377, 6892, 4953, 4807, 7127,...
## $ income             <int> 51012, 50781, 72983, 48868, 72576, 63648, 62670,...
## $ income_err         <int> 2526, 9403, 1577, 1988, 8046, 2525, 8682, 5996, ...
## $ income_per_cap     <int> 17746, 23041, 29913, 19804, 51085, 21101, 30079,...
## $ income_per_cap_err <int> 768, 2272, 679, 768, 3137, 1158, 2460, 3006, 760...
## $ poverty            <dbl> 25.2, 16.8, 10.0, 27.0, 10.2, 23.8, 14.5, 10.9, ...
## $ child_poverty      <dbl> 31.0, 23.5, 12.5, 32.5, 14.7, 30.2, 21.1, 8.9, 3...
## $ professional       <dbl> 35.2, 34.1, 32.8, 34.7, 24.0, 33.8, 30.4, 35.4, ...
## $ service            <dbl> 19.3, 20.7, 16.6, 22.4, 15.0, 18.2, 21.2, 18.9, ...
## $ office             <dbl> 24.4, 13.4, 22.3, 22.1, 18.4, 22.4, 15.6, 19.0, ...
## $ construction       <dbl> 8.1, 16.3, 17.9, 11.4, 25.0, 14.9, 21.5, 20.5, 1...
## $ production         <dbl> 13.0, 15.5, 10.5, 9.4, 17.5, 10.8, 11.2, 6.2, 10...
## $ drive              <dbl> 26.5, 21.2, 72.0, 25.8, 20.1, 16.5, 52.3, 63.2, ...
## $ carpool            <dbl> 12.7, 6.8, 12.5, 10.0, 17.0, 10.4, 17.0, 9.9, 8....
## $ transit            <dbl> 0.5, 1.1, 1.1, 0.3, 2.8, 0.4, 0.0, 0.6, 0.2, 1.7
## $ walk               <dbl> 33.0, 36.2, 1.9, 36.9, 37.9, 46.9, 14.5, 16.5, 3...
## $ other_transp       <dbl> 22.6, 32.4, 6.8, 22.7, 7.9, 21.2, 1.1, 3.0, 20.1...
## $ work_at_home       <dbl> 4.8, 2.4, 5.7, 4.3, 14.3, 4.6, 15.1, 6.7, 4.9, 4.2
## $ mean_commute       <dbl> 7.4, 6.0, 34.0, 7.0, 8.4, 7.4, 13.0, 9.6, 8.6, 30.4
## $ employed           <int> 6232, 657, 40854, 3654, 5499, 2593, 3038, 4415, ...
## $ private_work       <dbl> 53.2, 42.2, 70.6, 50.6, 73.8, 54.0, 64.0, 56.6, ...
## $ public_work        <dbl> 44.8, 51.6, 21.5, 44.0, 25.2, 43.7, 27.8, 32.9, ...
## $ self_employed      <dbl> 2.0, 6.1, 7.8, 5.2, 0.9, 2.1, 7.8, 10.1, 5.1, 6.7
## $ family_work        <dbl> 0.0, 0.2, 0.2, 0.2, 0.0, 0.2, 0.4, 0.4, 0.0, 0.2
## $ unemployment_rate  <dbl> 17.6, 9.8, 9.8, 16.5, 9.3, 21.9, 9.4, 10.7, 18.2...
## $ land_area          <dbl> 40570.00, 23652.01, 24607.90, 22961.76, 88695.41...

Choose the right verb for keeping only the columns state, county, and the ones containing poverty.

# Keep the state and county columns, and the columns containing poverty
counties %>%
  select(state, county, contains("poverty")) %>% top_n(n=10)
## Selecting by child_poverty
##           state        county poverty child_poverty
## 1       Alabama        Greene    40.2          67.0
## 2       Georgia       Stewart    38.4          61.6
## 3     Louisiana  East Carroll    48.0          66.9
## 4   Mississippi        Holmes    43.4          60.9
## 5   Mississippi     Issaquena    40.3          63.7
## 6   Mississippi     Jefferson    48.7          63.2
## 7   Mississippi       Quitman    39.5          60.8
## 8       Montana Golden Valley    26.6          60.1
## 9  South Dakota      Mellette    45.3          72.3
## 10        Texas      Hudspeth    40.3          63.8

Calculate a new column called fraction_women with the fraction of the population made up of women, without dropping any columns.

# Calculate the fraction_women column without dropping the other columns
c1 <- counties %>%
  mutate(fraction_women = women / population)%>% top_n(n=10)
## Selecting by fraction_women
glimpse(c1)
## Rows: 10
## Columns: 42
## $ X                  <int> 60, 501, 1462, 1540, 1800, 2589, 2861, 2923, 293...
## $ census_id          <int> 1119, 13235, 28125, 29117, 35011, 48137, 51091, ...
## $ state              <chr> "Alabama", "Georgia", "Mississippi", "Missouri",...
## $ county             <chr> "Sumter", "Pulaski", "Sharkey", "Livingston", "D...
## $ region             <chr> "South", "South", "South", "North Central", "Wes...
## $ metro              <chr> "Nonmetro", "Metro", "Nonmetro", "Nonmetro", "No...
## $ population         <int> 13341, 11590, 4805, 15042, 2020, 1906, 2244, 845...
## $ men                <int> 5905, 4866, 2139, 6787, 907, 854, 1003, 3766, 16...
## $ women              <int> 7436, 6724, 2666, 8255, 1113, 1052, 1241, 4691, ...
## $ hispanic           <dbl> 0.4, 4.7, 0.6, 1.5, 45.1, 51.0, 0.0, 1.0, 2.4, 2.6
## $ white              <dbl> 24.8, 59.1, 27.7, 93.4, 50.2, 48.8, 99.5, 38.2, ...
## $ black              <dbl> 72.4, 33.6, 71.3, 2.6, 0.0, 0.2, 0.0, 58.6, 3.4,...
## $ native             <dbl> 0.4, 0.0, 0.0, 0.7, 0.0, 0.1, 0.4, 0.0, 0.1, 0.2
## $ asian              <dbl> 1.7, 0.1, 0.4, 0.5, 0.0, 0.0, 0.0, 0.6, 5.8, 1.0
## $ pacific            <dbl> 0.0, 0.0, 0.0, 0.3, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
## $ citizens           <int> 10471, 9226, 3566, 11773, 1509, 1480, 1920, 6272...
## $ income             <int> 19501, 38750, 30525, 43350, 32500, 41813, 43914,...
## $ income_err         <int> 2420, 6998, 4079, 3719, 7616, 16850, 5532, 4457,...
## $ income_per_cap     <int> 12887, 17656, 15993, 22557, 24370, 27893, 26809,...
## $ income_per_cap_err <int> 809, 1823, 1369, 1584, 9885, 9683, 2995, 1970, 2...
## $ poverty            <dbl> 42.7, 19.2, 29.3, 16.6, 22.8, 10.8, 12.6, 21.2, ...
## $ child_poverty      <dbl> 57.1, 34.6, 33.9, 23.0, 14.4, 20.8, 21.5, 32.4, ...
## $ professional       <dbl> 23.8, 32.6, 36.4, 30.4, 27.5, 22.7, 28.7, 30.1, ...
## $ service            <dbl> 29.9, 22.8, 27.4, 18.9, 27.9, 20.9, 12.5, 29.9, ...
## $ office             <dbl> 19.7, 17.1, 21.3, 22.9, 15.3, 26.9, 28.7, 17.3, ...
## $ construction       <dbl> 7.1, 13.0, 11.6, 9.0, 22.0, 22.0, 26.0, 9.6, 6.5...
## $ production         <dbl> 19.6, 14.4, 3.3, 18.7, 7.3, 7.5, 4.0, 13.1, 10.9...
## $ drive              <dbl> 86.2, 85.4, 82.3, 79.7, 75.7, 56.2, 72.1, 81.7, ...
## $ carpool            <dbl> 8.4, 7.4, 13.6, 8.8, 12.1, 14.4, 8.5, 10.2, 2.8,...
## $ transit            <dbl> 0.3, 0.0, 1.7, 0.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5
## $ walk               <dbl> 2.8, 5.1, 1.7, 3.1, 2.8, 5.7, 13.4, 4.1, 2.7, 5.1
## $ other_transp       <dbl> 0.3, 0.2, 0.3, 1.6, 2.1, 7.2, 1.0, 0.4, 0.0, 1.9
## $ work_at_home       <dbl> 2.0, 1.9, 0.4, 6.3, 7.4, 16.5, 5.0, 3.5, 1.5, 3.1
## $ mean_commute       <dbl> 24.3, 21.3, 20.0, 18.1, 12.7, 23.2, 21.3, 24.2, ...
## $ employed           <int> 4416, 3949, 1678, 6641, 619, 912, 1006, 3429, 18...
## $ private_work       <dbl> 71.1, 69.3, 71.7, 72.6, 66.2, 57.2, 61.8, 63.9, ...
## $ public_work        <dbl> 23.8, 24.1, 23.5, 17.2, 25.5, 23.0, 21.3, 28.9, ...
## $ self_employed      <dbl> 5.1, 6.5, 4.7, 9.7, 8.2, 19.7, 16.9, 6.3, 2.4, 5.9
## $ family_work        <dbl> 0.0, 0.1, 0.1, 0.5, 0.0, 0.0, 0.0, 0.9, 0.0, 0.2
## $ unemployment       <dbl> 16.8, 7.4, 22.1, 5.1, 5.2, 4.9, 1.1, 13.7, 11.1,...
## $ land_area          <dbl> 903.89, 249.03, 431.72, 532.33, 2322.62, 2117.86...
## $ fraction_women     <dbl> 0.5573795, 0.5801553, 0.5548387, 0.5487967, 0.55...

Keep only three columns: the state, county, and employed / population, which you’ll call employment_rate.

# Keep only the state, county, and employment_rate columns
c2 <- counties %>%
  transmute(state, county, employment_rate = employed / population) %>% top_n(n=10)
## Selecting by employment_rate
glimpse(c2)
## Rows: 10
## Columns: 3
## $ state           <chr> "Alaska", "Alaska", "Alaska", "Colorado", "Hawaii",...
## $ county          <chr> "Aleutians East Borough", "Aleutians West Census Ar...
## $ employment_rate <dbl> 0.7624092, 0.6662562, 0.6766767, 0.6278853, 0.75294...

Case Study: The babynames Dataset

The babynames data

Filtering and arranging for one year

The dplyr verbs you’ve learned are useful for exploring data. For instance, you could find out the most common names in a particular year.

EXERCISE:

Filter for only the year 1990. Sort the table in descending order of the number of babies born.

babynames <- read.csv("./Data/babynames.csv", sep = ",")
babynames %>%
  # Filter for the year 1990
  filter(year == 1990) %>%
  # Sort the number column in descending order 
  arrange(desc(number)) %>% top_n ( n=10)
## Selecting by number
##         X year        name number
## 1  194359 1990     Michael  65560
## 2  184122 1990 Christopher  52520
## 3  189459 1990     Jessica  46615
## 4  181986 1990      Ashley  45797
## 5  194021 1990     Matthew  44925
## 6  189892 1990      Joshua  43382
## 7  182924 1990    Brittany  36650
## 8  181103 1990      Amanda  34504
## 9  184851 1990      Daniel  33963
## 10 185118 1990       David  33862

Using top_n with babynames

You saw that you could use filter() and arrange() to find the most common names in one year. However, you could also use group_by and top_n to find the most common name in every year.

EXERCISE:

Use group_by and top_n to find the most common name for US babies in each year.

# Find the most common name in each year
babynames %>% 
    group_by(year)%>%
    top_n(1, number)
## # A tibble: 28 x 4
## # Groups:   year [28]
##        X  year name  number
##    <int> <int> <chr>  <int>
##  1   956  1880 John    9701
##  2  3311  1885 Mary    9166
##  3  5643  1890 Mary   12113
##  4  8350  1895 Mary   13493
##  5 11506  1900 Mary   16781
##  6 14852  1905 Mary   16135
##  7 18717  1910 Mary   22947
##  8 25369  1915 Mary   58346
##  9 34119  1920 Mary   71175
## 10 43239  1925 Mary   70857
## # ... with 18 more rows

Visualizing names with ggplot2

The dplyr package is very useful for exploring data, but it’s especially useful when combined with other tidyverse packages like ggplot2.

EXERCISE:

Filter for only the names Steven, Thomas, and Matthew, and assign it to an object called selected_names.

# Filter for the names Steven, Thomas, and Matthew 
selected_names <- babynames %>%
  filter(name %in% c("Steven", "Thomas", "Matthew"))

Visualize those three names as a line plot over time, with each name represented by a different color.

# Filter for the names Steven, Thomas, and Matthew 
selected_names <- babynames %>%
  filter(name %in% c("Steven", "Thomas", "Matthew"))
head(selected_names)
##      X year    name number
## 1 1271 1880 Matthew    113
## 2 1687 1880  Steven     17
## 3 1731 1880  Thomas   2542
## 4 3325 1885 Matthew    111
## 5 3802 1885  Steven     21
## 6 3848 1885  Thomas   2275
# Plot the names using a different color for each name
ggplot(selected_names, aes(x = year, y = number, color = name)) +
  geom_line()

Grouped mutates

Finding the year each name is most common

In an earlier video, you learned how to filter for a particular name to determine the frequency of that name over time. Now, you’re going to explore which year each name was the most common.

To do this, you’ll be combining the grouped mutate approach with a top_n.

EXERCICES:

First, calculate the total number of people born in that year in this dataset as year_total. Next, use year_total to calculate the fraction of people born in each year that have each name.

# Calculate the fraction of people born each year with the same name
babynames %>%
  group_by(year) %>%
  mutate(year_total = sum(number)) %>%
  ungroup() %>%
  mutate(fraction = number/year_total)
## # A tibble: 332,595 x 6
##        X  year name    number year_total  fraction
##    <int> <int> <chr>    <int>      <int>     <dbl>
##  1     1  1880 Aaron      102     201478 0.000506 
##  2     2  1880 Ab           5     201478 0.0000248
##  3     3  1880 Abbie       71     201478 0.000352 
##  4     4  1880 Abbott       5     201478 0.0000248
##  5     5  1880 Abby         6     201478 0.0000298
##  6     6  1880 Abe         50     201478 0.000248 
##  7     7  1880 Abel         9     201478 0.0000447
##  8     8  1880 Abigail     12     201478 0.0000596
##  9     9  1880 Abner       27     201478 0.000134 
## 10    10  1880 Abraham     81     201478 0.000402 
## # ... with 332,585 more rows

Now use your newly calculated fraction column, in combination with top_n(), to identify the year each name is most common.

# Calculate the fraction of people born each year with the same name
babynames %>%
  group_by(year) %>%
  mutate(year_total = sum(number)) %>%
  ungroup() %>%
  mutate(fraction = number / year_total) %>%
# Find the year each name is most common
  group_by(name) %>%
  top_n(1, fraction)
## # A tibble: 48,040 x 6
## # Groups:   name [48,040]
##        X  year name      number year_total  fraction
##    <int> <int> <chr>      <int>      <int>     <dbl>
##  1     4  1880 Abbott         5     201478 0.0000248
##  2     6  1880 Abe           50     201478 0.000248 
##  3     9  1880 Abner         27     201478 0.000134 
##  4    21  1880 Adelbert      28     201478 0.000139 
##  5    25  1880 Adella        26     201478 0.000129 
##  6    30  1880 Adolf          6     201478 0.0000298
##  7    31  1880 Adolph        93     201478 0.000462 
##  8    37  1880 Agustus        5     201478 0.0000248
##  9    42  1880 Albert      1493     201478 0.00741  
## 10    44  1880 Albertina      7     201478 0.0000347
## # ... with 48,030 more rows

Adding the total and maximum for each name

In the video, you learned how you could group by the year and use mutate() to add a total for that year.

In these exercises, you’ll learn to normalize by a different, but also interesting metric: you’ll divide each name by the maximum for that name. This means that every name will peak at 1.

Once you add new columns, the result will still be grouped by name. This splits it into 48,000 groups, which actually makes later steps like mutates slower.

exercise:

Use a grouped mutate to add two columns: name_total, with the total number of babies born with that name in the entire dataset. name_max, with the highest number of babies born in any year.

# Add columns name_total and name_max for each name
head(babynames)
##   X year   name number
## 1 1 1880  Aaron    102
## 2 2 1880     Ab      5
## 3 3 1880  Abbie     71
## 4 4 1880 Abbott      5
## 5 5 1880   Abby      6
## 6 6 1880    Abe     50
babynames %>% 
    group_by(name)%>%
    mutate(name_total = sum(number), name_max = max(number))
## # A tibble: 332,595 x 6
## # Groups:   name [48,040]
##        X  year name    number name_total name_max
##    <int> <int> <chr>    <int>      <int>    <int>
##  1     1  1880 Aaron      102     114739    14635
##  2     2  1880 Ab           5         77       31
##  3     3  1880 Abbie       71       4330      445
##  4     4  1880 Abbott       5        217       51
##  5     5  1880 Abby         6      11272     1753
##  6     6  1880 Abe         50       1832      271
##  7     7  1880 Abel         9      10565     3245
##  8     8  1880 Abigail     12      72600    15762
##  9     9  1880 Abner       27       1552      199
## 10    10  1880 Abraham     81      17882     2449
## # ... with 332,585 more rows

Add another step to ungroup the table. Add a column called fraction_max, with the number in the year divided by the maximum for that name.

babynames %>%
  group_by(name) %>%
  mutate(name_total = sum(number),
         name_max = max(number)) %>%
  # Ungroup the table 
  ungroup()%>%
  # Add the fraction_max column containing the number by the name maximum 
  mutate(fraction_max = number/name_max)
## # A tibble: 332,595 x 7
##        X  year name    number name_total name_max fraction_max
##    <int> <int> <chr>    <int>      <int>    <int>        <dbl>
##  1     1  1880 Aaron      102     114739    14635     0.00697 
##  2     2  1880 Ab           5         77       31     0.161   
##  3     3  1880 Abbie       71       4330      445     0.160   
##  4     4  1880 Abbott       5        217       51     0.0980  
##  5     5  1880 Abby         6      11272     1753     0.00342 
##  6     6  1880 Abe         50       1832      271     0.185   
##  7     7  1880 Abel         9      10565     3245     0.00277 
##  8     8  1880 Abigail     12      72600    15762     0.000761
##  9     9  1880 Abner       27       1552      199     0.136   
## 10    10  1880 Abraham     81      17882     2449     0.0331  
## # ... with 332,585 more rows

Visualizing the normalized change in popularity

You picked a few names and calculated each of them as a fraction of their peak. This is a type of “normalizing” a name, where you’re focused on the relative change within each name rather than the overall popularity of the name.

In this exercise, you’ll visualize the normalized popularity of each name. Your work from the previous exercise, names_normalized, has been provided for you.

names_normalized <- babynames %>%
                     group_by(name) %>%
                     mutate(name_total = sum(number),
                            name_max = max(number)) %>%
                     ungroup() %>%
                     mutate(fraction_max = number / name_max)

EXERCISES:

Filter the names_normalized table to limit it to the three names Steven, Thomas, and Matthew.

# Filter for the names Steven, Thomas, and Matthew
names_filtered <- names_normalized %>%
  filter(name %in% c("Steven", "Thomas", "Matthew"))

Visualize fraction_max for those names over time.

# Visualize these names over time
ggplot(names_filtered, aes(x = year, y = fraction_max, color = name)) +
  geom_line()

5.- Joining Data with dplyr

Joining Tables

inner_join(): Joining parts and part categories

The inner_join is the key to bring tables together. To use it, you need to provide the two tables that must be joined and the columns on which they should be joined.

In this exercise, you’ll join a list of LEGO parts, available as parts, with these parts’ corresponding categories, available as part_categories. For example, the part Sticker Sheet 1 for Set 1650-1 is from the Stickers part category. You can join these tables to see all parts’ categories!

EXERCISE:

Add the correct joining verb, the name of the second table, and the joining column for the second table.

# Add the correct verb, table, and joining column
parts <- readRDS("./Data/parts.rds")
part_categories<- readRDS("./Data/part_categories.rds")

parts %>% inner_join(part_categories, by = c("part_cat_id" = "id") )
## # A tibble: 17,501 x 4
##    part_num   name.x                                  part_cat_id name.y        
##    <chr>      <chr>                                         <dbl> <chr>         
##  1 0901       Baseplate 16 x 30 with Set 080 Yellow ~           1 Baseplates    
##  2 0902       Baseplate 16 x 24 with Set 080 Small W~           1 Baseplates    
##  3 0903       Baseplate 16 x 24 with Set 080 Red Hou~           1 Baseplates    
##  4 0904       Baseplate 16 x 24 with Set 080 Large W~           1 Baseplates    
##  5 1          Homemaker Bookcase 2 x 4 x 4                      7 Containers    
##  6 10016414   Sticker Sheet #1 for 41055-1                     58 Stickers      
##  7 10026stk01 Sticker for Set 10026 - (44942/4184185)          58 Stickers      
##  8 10039      Pullback Motor 8 x 4 x 2/3                       44 Mechanical    
##  9 10048      Minifig Hair Tousled                             65 Minifig Headw~
## 10 10049      Minifig Shield Broad with Spiked Botto~          27 Minifig Acces~
## # ... with 17,491 more rows

EXERCISE:

Now, use the suffix argument to add "_part" and "_category" suffixes to replace the name.x and name.y fields.

# Use the suffix argument to replace .x and .y suffixes
parts %>% 
    inner_join(part_categories, by = c("part_cat_id" = "id"), suffix = c("_part","_category" ))
## # A tibble: 17,501 x 4
##    part_num   name_part                             part_cat_id name_category   
##    <chr>      <chr>                                       <dbl> <chr>           
##  1 0901       Baseplate 16 x 30 with Set 080 Yello~           1 Baseplates      
##  2 0902       Baseplate 16 x 24 with Set 080 Small~           1 Baseplates      
##  3 0903       Baseplate 16 x 24 with Set 080 Red H~           1 Baseplates      
##  4 0904       Baseplate 16 x 24 with Set 080 Large~           1 Baseplates      
##  5 1          Homemaker Bookcase 2 x 4 x 4                    7 Containers      
##  6 10016414   Sticker Sheet #1 for 41055-1                   58 Stickers        
##  7 10026stk01 Sticker for Set 10026 - (44942/41841~          58 Stickers        
##  8 10039      Pullback Motor 8 x 4 x 2/3                     44 Mechanical      
##  9 10048      Minifig Hair Tousled                           65 Minifig Headwear
## 10 10049      Minifig Shield Broad with Spiked Bot~          27 Minifig Accesso~
## # ... with 17,491 more rows

Joining with a one-to-many relationship: Joining parts and inventories

The LEGO data has many tables that can be joined together. Often times, some of the things you care about may be a few tables away (we’ll get to that later in the course). For now, we know that parts is a list of all LEGO parts, and a new table, inventory_parts, has some additional information about those parts, such as the color_id of each part you would find in a specific LEGO kit.

Let’s join these two tables together to observe how joining parts with inventory_parts increases the size of your table because of the one-to-many relationship that exists between these two tables.

EXERCISE:

Connect the parts and inventory_parts tables by their part numbers using an inner join.

# Combine the parts and inventory_parts tables
parts
## # A tibble: 17,501 x 3
##    part_num   name                                                   part_cat_id
##    <chr>      <chr>                                                        <dbl>
##  1 0901       Baseplate 16 x 30 with Set 080 Yellow House Print                1
##  2 0902       Baseplate 16 x 24 with Set 080 Small White House Print           1
##  3 0903       Baseplate 16 x 24 with Set 080 Red House Print                   1
##  4 0904       Baseplate 16 x 24 with Set 080 Large White House Print           1
##  5 1          Homemaker Bookcase 2 x 4 x 4                                     7
##  6 10016414   Sticker Sheet #1 for 41055-1                                    58
##  7 10026stk01 Sticker for Set 10026 - (44942/4184185)                         58
##  8 10039      Pullback Motor 8 x 4 x 2/3                                      44
##  9 10048      Minifig Hair Tousled                                            65
## 10 10049      Minifig Shield Broad with Spiked Bottom and Cutout Co~          27
## # ... with 17,491 more rows
inventory_parts <- readRDS("./Data/inventory_parts.rds")
parts %>% inner_join(inventory_parts, by = "part_num")
## # A tibble: 258,958 x 6
##    part_num name                      part_cat_id inventory_id color_id quantity
##    <chr>    <chr>                           <dbl>        <dbl>    <dbl>    <dbl>
##  1 0901     Baseplate 16 x 30 with S~           1         1973        2        1
##  2 0902     Baseplate 16 x 24 with S~           1         1973        2        1
##  3 0903     Baseplate 16 x 24 with S~           1         1973        2        1
##  4 0904     Baseplate 16 x 24 with S~           1         1973        2        1
##  5 1        Homemaker Bookcase 2 x 4~           7          508       15        1
##  6 1        Homemaker Bookcase 2 x 4~           7         1158       15        2
##  7 1        Homemaker Bookcase 2 x 4~           7         6590       15        2
##  8 1        Homemaker Bookcase 2 x 4~           7         9679       15        2
##  9 1        Homemaker Bookcase 2 x 4~           7        12256        1        2
## 10 1        Homemaker Bookcase 2 x 4~           7        13356       15        1
## # ... with 258,948 more rows

Joining in either direction

An inner_join works the same way with either table in either position. The table that is specified first is arbitrary, since you will end up with the same information in the resulting table either way.

Let’s prove this by joining the same two tables from the last exercise in the opposite order!

EXERCISE:

Connect the inventory_parts table with the parts tables.

# Combine the parts and inventory_parts tables
inventory_parts
## # A tibble: 258,958 x 4
##    inventory_id part_num             color_id quantity
##           <dbl> <chr>                   <dbl>    <dbl>
##  1           21 3009                        7       50
##  2           25 21019c00pat004pr1033       15        1
##  3           25 24629pr0002                78        1
##  4           25 24634pr0001                 5        1
##  5           25 24782pr0001                 5        1
##  6           25 88646                       0        1
##  7           25 973pr3314c01                5        1
##  8           26 14226c11                    0        3
##  9           26 2340px2                    15        1
## 10           26 2340px3                    15        1
## # ... with 258,948 more rows
parts
## # A tibble: 17,501 x 3
##    part_num   name                                                   part_cat_id
##    <chr>      <chr>                                                        <dbl>
##  1 0901       Baseplate 16 x 30 with Set 080 Yellow House Print                1
##  2 0902       Baseplate 16 x 24 with Set 080 Small White House Print           1
##  3 0903       Baseplate 16 x 24 with Set 080 Red House Print                   1
##  4 0904       Baseplate 16 x 24 with Set 080 Large White House Print           1
##  5 1          Homemaker Bookcase 2 x 4 x 4                                     7
##  6 10016414   Sticker Sheet #1 for 41055-1                                    58
##  7 10026stk01 Sticker for Set 10026 - (44942/4184185)                         58
##  8 10039      Pullback Motor 8 x 4 x 2/3                                      44
##  9 10048      Minifig Hair Tousled                                            65
## 10 10049      Minifig Shield Broad with Spiked Bottom and Cutout Co~          27
## # ... with 17,491 more rows
inventory_parts %>% inner_join(parts, by = "part_num" )
## # A tibble: 258,958 x 6
##    inventory_id part_num    color_id quantity name                   part_cat_id
##           <dbl> <chr>          <dbl>    <dbl> <chr>                        <dbl>
##  1           21 3009               7       50 Brick 1 x 6                     11
##  2           25 21019c00pa~       15        1 Legs and Hips with Bl~          61
##  3           25 24629pr0002       78        1 Minifig Head Special ~          59
##  4           25 24634pr0001        5        1 Headwear Accessory Bo~          27
##  5           25 24782pr0001        5        1 Minifig Hipwear Skirt~          27
##  6           25 88646              0        1 Tile Special 4 x 3 wi~          15
##  7           25 973pr3314c~        5        1 Torso with 1 White Bu~          60
##  8           26 14226c11           0        3 String with End Studs~          31
##  9           26 2340px2           15        1 Tail 4 x 1 x 3 with '~          35
## 10           26 2340px3           15        1 Tail 4 x 1 x 3 with '~          35
## # ... with 258,948 more rows

Joining three tables

You can string together multiple joins with inner_join and the pipe (%>%), both with which you are already very familiar!

We’ll now connect sets, a table that tells us about each LEGO kit, with inventories, a table that tells us the specific version of a given set, and finally to inventory_parts, a table which tells us how many of each part is available in each LEGO kit.

So if you were building a Batman LEGO set, sets would tell you the name of the set, inventories would give you IDs for each of the versions of the set, and inventory_parts would tell you how many of each part would be in each version.

EXERCISE:

Combine the inventories table with the sets table. Next, join the inventory_parts table to the table you created in the previous join by the inventory IDs.

sets <- readRDS("./Data/sets.rds")
inventories <- readRDS("./Data/inventories.rds")
sets %>%
    # Add inventories using an inner join 
    inner_join(inventories, by = "set_num") %>%
    # Add inventory_parts using an inner join 
    inner_join(inventory_parts, by = c("id" = "inventory_id"))
## # A tibble: 258,958 x 9
##    set_num name           year theme_id    id version part_num color_id quantity
##    <chr>   <chr>         <dbl>    <dbl> <dbl>   <dbl> <chr>       <dbl>    <dbl>
##  1 700.3-1 Medium Gift ~  1949      365 24197       1 bdoor01         2        2
##  2 700.3-1 Medium Gift ~  1949      365 24197       1 bdoor01        15        1
##  3 700.3-1 Medium Gift ~  1949      365 24197       1 bdoor01         4        1
##  4 700.3-1 Medium Gift ~  1949      365 24197       1 bslot02        15        6
##  5 700.3-1 Medium Gift ~  1949      365 24197       1 bslot02         2        6
##  6 700.3-1 Medium Gift ~  1949      365 24197       1 bslot02         4        6
##  7 700.3-1 Medium Gift ~  1949      365 24197       1 bslot02         1        6
##  8 700.3-1 Medium Gift ~  1949      365 24197       1 bslot02        14        6
##  9 700.3-1 Medium Gift ~  1949      365 24197       1 bslot02a       15        6
## 10 700.3-1 Medium Gift ~  1949      365 24197       1 bslot02a        2        6
## # ... with 258,948 more rows

What’s the most common color?

Now let’s join an additional table, colors, which will tell us the color of each part in each set, so that we can answer the question, “what is the most common color of a LEGO piece?”

EXERCISE:

Inner join the colors table using the color_id column from the previous join and the id column from colors; use the suffixes "_set" and "_color".

Left and Right Joins

The left_join verb: Left joining two sets by part and color

In the video, you learned how to left join two LEGO sets. Now you’ll practice your ability to do this looking at two new sets: the Millennium Falcon and Star Destroyer sets. We’ve created these for you and they have been preloaded for you:

EXERCISE:

Combine the star_destroyer and millennium_falcon tables with the suffixes _falcon and _star_destroyer.

inventory_parts_joined <- inventories %>%
  inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
  select(-id, -version) %>%
  arrange(desc(quantity))

millennium_falcon <- inventory_parts_joined %>%
  filter(set_num == "7965-1")

star_destroyer <- inventory_parts_joined %>%
  filter(set_num == "75190-1")
# Combine the star_destroyer and millennium_falcon tables
millennium_falcon %>%
  left_join(star_destroyer, by = c("part_num", "color_id"), suffix = c("_falcon", "_star_destroyer"))
## # A tibble: 263 x 6
##    set_num_falcon part_num color_id quantity_falcon set_num_star_de~
##    <chr>          <chr>       <dbl>           <dbl> <chr>           
##  1 7965-1         63868          71              62 <NA>            
##  2 7965-1         3023            0              60 <NA>            
##  3 7965-1         3021           72              46 75190-1         
##  4 7965-1         2780            0              37 75190-1         
##  5 7965-1         60478          72              36 <NA>            
##  6 7965-1         6636           71              34 75190-1         
##  7 7965-1         3009           71              28 75190-1         
##  8 7965-1         3665           71              22 <NA>            
##  9 7965-1         2412b          72              20 75190-1         
## 10 7965-1         3010           71              19 <NA>            
## # ... with 253 more rows, and 1 more variable: quantity_star_destroyer <dbl>

Sum the quantity column by color_id in the Millennium Falcon dataset.

# Aggregate Millennium Falcon for the total quantity in each part
millennium_falcon_colors <- millennium_falcon %>%
  group_by(color_id) %>%
  summarize(total_quantity = sum(quantity))

millennium_falcon_colors
## # A tibble: 21 x 2
##    color_id total_quantity
##       <dbl>          <dbl>
##  1        0            201
##  2        1             15
##  3        4             17
##  4       14              3
##  5       15             15
##  6       19             95
##  7       28              3
##  8       33              5
##  9       36              1
## 10       41              6
## # ... with 11 more rows

Use a left_join to join together sets and inventory_version_1 using their common column. filter for where the version column is NA using is.na.

inventory_version_1 <- inventories %>%
    filter(version == 1)

# Join versions to sets
sets %>%
  left_join(inventory_version_1, by = "set_num") %>%
  # Filter for where version is na
  filter(is.na(version))
## # A tibble: 1 x 6
##   set_num name       year theme_id    id version
##   <chr>   <chr>     <dbl>    <dbl> <dbl>   <dbl>
## 1 40198-1 Ludo game  2018      598    NA      NA

The right-join verb

Counting part colors

Sometimes you’ll want to do some processing before you do a join, and prioritize keeping the second (right) table’s rows instead. In this case, a right join is for you.

In the example below, we’ll count the part_cat_id from parts, before using a right_join to join with part_categories. The reason we do this is because we don’t only want to know the count of part_cat_id in parts, but we also want to know if there are any part_cat_ids not present in parts.

EXERCISE:

Use the count verb to count each part_cat_id in the parts table. Use a right_join to join part_categories. You’ll need to use the part_cat_id from the count and the id column from part_categories.

parts %>%
    # Count the part_cat_id
    count(part_cat_id) %>%
    # Right join part_categories
    right_join(part_categories, by = c("part_cat_id" = "id"))
## # A tibble: 64 x 3
##    part_cat_id     n name                   
##          <dbl> <int> <chr>                  
##  1           1   135 Baseplates             
##  2           3   303 Bricks Sloped          
##  3           4  1900 Duplo, Quatro and Primo
##  4           5   107 Bricks Special         
##  5           6   128 Bricks Wedged          
##  6           7    97 Containers             
##  7           8    24 Technic Bricks         
##  8           9   167 Plates Special         
##  9          11   490 Bricks                 
## 10          12    85 Technic Connectors     
## # ... with 54 more rows

Cleaning up your count

In both left and right joins, there is the opportunity for there to be NA values in the resulting table. Fortunately, the replace_na function can turn those NAs into meaningful values.

In the last exercise, we saw that the n column had NAs after the right_join. Let’s use the replace_na column, which takes a list of column names and the values with which NAs should be replaced, to clean up our table.

EXERCISE:

Use replace_na to replace NAs in the n column with the value 0.

library(dplyr)
library(magrittr)
library(tidyr)
parts %>%
    count(part_cat_id) %>%
    right_join(part_categories, by = c("part_cat_id" = "id")) %>%
    # Use replace_na to replace missing values in the n column
    replace_na(list(n = 0))
## # A tibble: 64 x 3
##    part_cat_id     n name                   
##          <dbl> <dbl> <chr>                  
##  1           1   135 Baseplates             
##  2           3   303 Bricks Sloped          
##  3           4  1900 Duplo, Quatro and Primo
##  4           5   107 Bricks Special         
##  5           6   128 Bricks Wedged          
##  6           7    97 Containers             
##  7           8    24 Technic Bricks         
##  8           9   167 Plates Special         
##  9          11   490 Bricks                 
## 10          12    85 Technic Connectors     
## # ... with 54 more rows

Joining tables to themselves

Joining themes to their children

Tables can be joined to themselves!

In the themes table, which is available for you to inspect in the console, you’ll notice there is both an id column and a parent_id column. Keeping that in mind, you can join the themes table to itself to determine the parent-child relationships that exist for different themes.

In the videos, you saw themes joined to their own parents. In this exercise, you’ll try a similar approach of joining themes to their own children, which is similar but reversed. Let’s try this out to discover what children the theme “Harry Potter” has.

EXERCISE:

Inner join themes to their own children, resulting in the suffixes "_parent" and "_child“, respectively. Filter this table to find the children of the”Harry Potter" theme.

themes <- readRDS("./Data/themes.rds")
themes %>% 
    # Inner join the themes table
    inner_join(themes, by = c("id" = "parent_id"), suffix = c("_parent", "_child")) %>% 
    # Filter for the "Harry Potter" parent name 
    filter(name_parent == "Harry Potter")
## # A tibble: 6 x 5
##      id name_parent  parent_id id_child name_child          
##   <dbl> <chr>            <dbl>    <dbl> <chr>               
## 1   246 Harry Potter        NA      247 Chamber of Secrets  
## 2   246 Harry Potter        NA      248 Goblet of Fire      
## 3   246 Harry Potter        NA      249 Order of the Phoenix
## 4   246 Harry Potter        NA      250 Prisoner of Azkaban 
## 5   246 Harry Potter        NA      251 Sorcerer's Stone    
## 6   246 Harry Potter        NA      667 Fantastic Beasts

Joining themes to their grandchildren

We can go a step further than looking at themes and their children. Some themes actually have grandchildren: their children’s children.

Here, we can inner join themes to a filtered version of itself again to establish a connection between our last join’s children and their children.

EXERCISE:

Use another inner join to combine themes again with itself. Be sure to use the suffixes "_parent" and "_grandchild" so the columns in the resulting table are clear. Update the by argument to specify the correct columns to join on. If you’re unsure of what columns to join on, it might help to look at the result of the first join to get a feel for it.

themes %>% 
  inner_join(themes, by = c("id" = "parent_id"), suffix = c("_parent", "_child")) %>%
  inner_join(themes, by = c("id_child" = "parent_id"), suffix = c("_parent", "_grandchild"))
## # A tibble: 158 x 7
##    id_parent name_parent parent_id id_child name_child id_grandchild name       
##        <dbl> <chr>           <dbl>    <dbl> <chr>              <dbl> <chr>      
##  1         1 Technic            NA        5 Model                  6 Airport    
##  2         1 Technic            NA        5 Model                  7 Constructi~
##  3         1 Technic            NA        5 Model                  8 Farm       
##  4         1 Technic            NA        5 Model                  9 Fire       
##  5         1 Technic            NA        5 Model                 10 Harbor     
##  6         1 Technic            NA        5 Model                 11 Off-Road   
##  7         1 Technic            NA        5 Model                 12 Race       
##  8         1 Technic            NA        5 Model                 13 Riding Cyc~
##  9         1 Technic            NA        5 Model                 14 Robot      
## 10         1 Technic            NA        5 Model                 15 Traffic    
## # ... with 148 more rows

Left-joining a table to itself

So far, you’ve been inner joining a table to itself in order to find the children of themes like “Harry Potter” or “The Lord of the Rings”.

But some themes might not have any children at all, which means they won’t be included in the inner join. As you’ve learned in this chapter, you can identify those with a left_join and a filter().

EXERCISE:

Left join the themes table to its own children, with the suffixes _parent and _child respectively. Filter the result of the join to find themes that have no children.

themes %>% 
  # Left join the themes table to its own children
  left_join(themes, by = c("id" = "parent_id"), suffix = c("_parent", "_child")) %>%
  # Filter for themes that have no child themes
  filter(is.na(id_child))
## # A tibble: 586 x 5
##       id name_parent    parent_id id_child name_child
##    <dbl> <chr>              <dbl>    <dbl> <chr>     
##  1     2 Arctic Technic         1       NA <NA>      
##  2     3 Competition            1       NA <NA>      
##  3     4 Expert Builder         1       NA <NA>      
##  4     6 Airport                5       NA <NA>      
##  5     7 Construction           5       NA <NA>      
##  6     8 Farm                   5       NA <NA>      
##  7     9 Fire                   5       NA <NA>      
##  8    10 Harbor                 5       NA <NA>      
##  9    11 Off-Road               5       NA <NA>      
## 10    12 Race                   5       NA <NA>      
## # ... with 576 more rows

Full, Semi, and Anti Joins

The full_join verb

Differences between Batman and Star Wars

In the video, you compared two sets. Now, you’ll compare two themes, each of which is made up of many sets.

First, you’ll need to join in the themes. Recall that doing so requires going through the sets first. You’ll use the inventory_parts_joined table from the video, which is already available to you in the console.

EXERCISE:

In order to join in the themes, you’ll first need to combine the inventory_parts_joined and sets tables. Keep only the observations that have a match in both tables. Then, combine the first join with the themes table, using the suffix argument to clarify which table each name came from ("_set" or "_theme"). Keep only the observations that have a match in both tables. If your exercise is timing out, you are likely joining the wrong tables, causing the join result to be too big! Check your code carefully!

inventory_parts_joined <- inventories %>%
  inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
  arrange(desc(quantity)) %>%
  select(-id, -version)
# Start with inventory_parts_joined table

inventory_parts_joined %>%
  # Combine with the sets table 
   inner_join(sets, by = c("set_num")) %>%
  # Combine with the themes table 
    inner_join(themes, by = c("theme_id" = "id"), suffix = c("_set", "_theme"))
## # A tibble: 258,958 x 9
##    set_num part_num color_id quantity name_set  year theme_id name_theme
##    <chr>   <chr>       <dbl>    <dbl> <chr>    <dbl>    <dbl> <chr>     
##  1 40179-1 3024           72      900 Persona~  2016      277 Mosaic    
##  2 40179-1 3024           15      900 Persona~  2016      277 Mosaic    
##  3 40179-1 3024            0      900 Persona~  2016      277 Mosaic    
##  4 40179-1 3024           71      900 Persona~  2016      277 Mosaic    
##  5 40179-1 3024           14      900 Persona~  2016      277 Mosaic    
##  6 k34434~ 3024           15      810 Lego Mo~  2003      277 Mosaic    
##  7 21010-1 3023          320      771 Robie H~  2011      252 Architect~
##  8 k34431~ 3024            0      720 Lego Mo~  2003      277 Mosaic    
##  9 42083-1 2780            0      684 Bugatti~  2018        5 Model     
## 10 k34434~ 3024            0      540 Lego Mo~  2003      277 Mosaic    
## # ... with 258,948 more rows, and 1 more variable: parent_id <dbl>

Aggregating each theme

Previously, you combined tables to compare themes. Before doing this comparison, you’ll want to aggregate the data to learn more about the pieces that are a part of each theme, as well as the colors of those pieces.

The table you created previously has been preloaded for you as inventory_sets_themes. It was filtered for each theme, and the objects have been saved as batman and star_wars.

inventory_sets_themes <- inventory_parts_joined %>%
  inner_join(sets, by = "set_num") %>%
  inner_join(themes, by = c("theme_id" = "id"), suffix = c("_set", "_theme"))

batman <- inventory_sets_themes %>%
  filter(name_theme == "Batman")

star_wars <- inventory_sets_themes %>%
  filter(name_theme == "Star Wars")

EXERCISE:

Count the part number and color id for the parts in Batman and Star Wars, weighted by quantity.

# Count the part number and color id, weight by quantity
batman %>% 
count(part_num, color_id, wt = quantity)
## # A tibble: 2,071 x 3
##    part_num color_id     n
##    <chr>       <dbl> <dbl>
##  1 10113           0    11
##  2 10113         272     1
##  3 10113         320     1
##  4 10183          57     1
##  5 10190           0     2
##  6 10201           0     1
##  7 10201           4     3
##  8 10201          14     1
##  9 10201          15     6
## 10 10201          71     4
## # ... with 2,061 more rows
star_wars %>%
  count(part_num, color_id, wt = quantity)
## # A tibble: 2,413 x 3
##    part_num color_id     n
##    <chr>       <dbl> <dbl>
##  1 10169           4     1
##  2 10197           0     2
##  3 10197          72     3
##  4 10201           0    21
##  5 10201          71     5
##  6 10247           0     9
##  7 10247          71    16
##  8 10247          72    12
##  9 10884          28     1
## 10 10928          72     6
## # ... with 2,403 more rows

Full-joining Batman and Star Wars LEGO parts

Now that you’ve got separate tables for the pieces in the batman and star_wars themes, you’ll want to be able to combine them to see any similarities or differences between the two themes. The aggregating from the last exercise has been saved as batman_parts and star_wars_parts, and is preloaded for you.

batman_parts <- batman %>%
  count(part_num, color_id, wt = quantity)

star_wars_parts <- star_wars %>%
  count(part_num, color_id, wt = quantity)

EXERCISE:

Combine the star_wars_parts table with the batman_parts table; use the suffix argument to include the "_batman" and "_star_wars" suffixes. Replace all the NA values in the n_batman and n_star_wars columns with 0s.

batman_parts %>%
  # Combine the star_wars_parts table 
  full_join(star_wars_parts, by = c("part_num", "color_id"), suffix = c("_batman", "_star_wars")) %>%
  # Replace NAs with 0s in the n_batman and n_star_wars columns 
  replace_na(list(n_batman = 0, n_star_wars = 0))
## # A tibble: 3,628 x 4
##    part_num color_id n_batman n_star_wars
##    <chr>       <dbl>    <dbl>       <dbl>
##  1 10113           0       11           0
##  2 10113         272        1           0
##  3 10113         320        1           0
##  4 10183          57        1           0
##  5 10190           0        2           0
##  6 10201           0        1          21
##  7 10201           4        3           0
##  8 10201          14        1           0
##  9 10201          15        6           0
## 10 10201          71        4           5
## # ... with 3,618 more rows

Comparing Batman and Star Wars LEGO parts

The table you created in the last exercise includes the part number of each piece, the color id, and the number of each piece in the Star Wars and Batman themes. However, we have more information about each of these parts that we can gain by combining this table with some of the information we have in other tables. Before we compare the themes, let’s ensure that we have enough information to make our findings more interpretable. The table from the last exercise has been saved as parts_joined and is preloaded for you.

parts_joined <- batman_parts %>%
  full_join(star_wars_parts, by = c("part_num", "color_id"), suffix = c("_batman", "_star_wars")) %>%
  replace_na(list(n_batman = 0, n_star_wars = 0))

EXERICES:

Sort the number of star wars pieces in the parts_joined table in descending order. Join the colors table to the parts_joined table. Combine the parts table to the previous join; add "_color" and "_part" suffixes to specify whether or not the information came from the colors table or the parts table.

The semi- and anti-join verbs

In the videos, you learned how to filter using the semi- and anti-join verbs to answer questions you have about your data. Let’s focus on the batwing dataset, and use our skills to determine which parts are in both the batwing and batmobile sets, and which sets are in one, but not the other. While answering these questions, we’ll also be determining whether or not the parts we’re looking at in both sets also have the same color in common.

The batmobile and batwing datasets have been preloaded for you.

batmobile <- inventory_parts_joined %>%
  filter(set_num == "7784-1") %>%
  select(-set_num)

batwing <- inventory_parts_joined %>%
  filter(set_num == "70916-1") %>%
  select(-set_num)

EXERICES:

Filter the batwing set for parts that are also in the batmobile, whether or not they have the same color.

# Filter the batwing set for parts that are also in the batmobile set
batwing %>%
  semi_join(batmobile, by = c("part_num"))
## # A tibble: 126 x 3
##    part_num color_id quantity
##    <chr>       <dbl>    <dbl>
##  1 3023            0       22
##  2 3024            0       22
##  3 3623            0       20
##  4 2780            0       17
##  5 3666            0       16
##  6 3710            0       14
##  7 6141            4       12
##  8 2412b          71       10
##  9 6141           72       10
## 10 6558            1        9
## # ... with 116 more rows

Filter the batwing set for parts that aren’t also in the batmobile, whether or not they have the same color.

# Filter the batwing set for parts that aren't in the batmobile set
batwing %>%
  anti_join(batmobile, by = c("part_num"))
## # A tibble: 183 x 3
##    part_num color_id quantity
##    <chr>       <dbl>    <dbl>
##  1 11477           0       18
##  2 99207          71       18
##  3 22385           0       14
##  4 99563           0       13
##  5 10247          72       12
##  6 2877           72       12
##  7 61409          72       12
##  8 11153           0       10
##  9 98138          46       10
## 10 2419           72        9
## # ... with 173 more rows

What colors are included in at least one set?

Besides comparing two sets directly, you could also use a filtering join like semi_join to find out which colors ever appear in any inventory part. Some of the colors could be optional, meaning they aren’t included in any sets.

The inventory_parts and colors tables have been preloaded for you.

EXERCISE:

Use the inventory_parts table to find the colors that are included in at least one set.

Which set is missing version 1?

Each set included in the LEGO data has an associated version number. We want to understand the version we are looking at to learn more about the parts that are included. Before doing that, we should confirm that there aren’t any sets that are missing a particular version.

Let’s start by looking at the first version of each set to see if there are any sets that don’t include a first version.

EXERCISE:

Use filter() to extract version 1 from the inventories table; save the filter to version_1_inventories.

# Use filter() to extract version 1 
version_1_inventories <- inventories %>%
  filter(version == 1)

Use anti_join to combine version_1_inventories with sets to determine which set is missing a version 1.

# Use anti_join() to find which set is missing a version 1
sets %>%
  anti_join(version_1_inventories, by = c("set_num"))
## # A tibble: 1 x 4
##   set_num name       year theme_id
##   <chr>   <chr>     <dbl>    <dbl>
## 1 40198-1 Ludo game  2018      598

Visualizing set differences

Aggregating sets to look at their differences

To compare two individual sets, and the kinds of LEGO pieces that comprise them, we’ll need to aggregate the data into separate themes. Additionally, as we saw in the video, we’ll want to add a column so that we can understand the percentages of specific pieces that are part of each set, rather than looking at the numbers of pieces alone.

The inventory_parts_themes table has been preloaded for you.

inventory_parts_themes <- inventories %>%
  inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
  arrange(desc(quantity)) %>%
  select(-id, -version) %>%
  inner_join(sets, by = "set_num") %>%
  inner_join(themes, by = c("theme_id" = "id"), suffix = c("_set", "_theme"))

Add a filter for the “Batman” set to create the batman_colors object. Add a percent column to batman_colors that displays the total divided by the sum of the total.

batman_colors <- inventory_parts_themes %>%
  # Filter the inventory_parts_themes table for the Batman theme
  filter(name_theme == "Batman") %>%
  group_by(color_id) %>%
  summarize(total = sum(quantity)) %>%
  # Add a percent column of the total divided by the sum of the total 
  mutate(percent = total / sum(total))

Repeat the steps to filter and aggregate the “Star Wars” set data to create the star_wars_colors object. Again, add a percent column to display the percent of the total.

# Filter and aggregate the Star Wars set data; add a percent column
star_wars_colors <- inventory_parts_themes %>%
  filter(name_theme == "Star Wars") %>%
  group_by(color_id) %>%
  summarize(total = sum(quantity)) %>%
  mutate(percent = total/ sum(total))

Combining sets

The data you aggregated in the last exercise has been preloaded for you as batman_colors and star_wars_colors. Prior to visualizing the data, you’ll want to combine these tables to be able to directly compare the themes’ colors.

batman_colors <- inventory_parts_themes %>%
  filter(name_theme == "Batman") %>%
  group_by(color_id) %>%
  summarize(total = sum(quantity)) %>%
  mutate(percent = total / sum(total))
star_wars_colors <- inventory_parts_themes %>%
  filter(name_theme == "Star Wars") %>%
  group_by(color_id) %>%
  summarize(total = sum(quantity)) %>%
  mutate(percent = total / sum(total))

EXERCISE:

Join the batman_colors and star_wars_colors tables; be sure to include all observations from both tables. Replace the NAs in the total_batman and total_star_wars columns.

Visualizing the difference: Batman and Star Wars

In the last exercise, you created colors_joined. Now you’ll create a bar plot with one bar for each color (name), showing the difference in percentages.

Because factors and visualization are beyond the scope of this course, we’ve done some processing for you: here is the code that created the colors_joined table that will be used in the video.

6. Introduction to Data Visualization with ggplot2

Introduction

Drawing your first plot

To get a first feel for ggplot2, let’s try to run some basic ggplot2 commands. The mtcars dataset contains information on 32 cars from a 1973 issue of Motor Trend magazine. This dataset is small, intuitive, and contains a variety of continuous and categorical variables.

EXERCISE:

Load the ggplot2 package using library().

# Load the ggplot2 package
library(ggplot2)

Use str() to explore the structure of the mtcars dataset.

# Explore the mtcars data frame with str()
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Hit Submit Answer. This will execute the example code on the right. See if you can understand what ggplot does with the data.

# Execute the following command
ggplot(mtcars, aes(cyl, mpg)) +
  geom_point()

Data columns types affect plot types

The plot from the previous exercise wasn’t really satisfying. Although cyl (the number of cylinders) is categorical, you probably noticed that it is classified as numeric in mtcars. This is really misleading because the representation in the plot doesn’t match the actual data type. You’ll have to explicitly tell ggplot2 that cyl is a categorical variable.

EXERCISES:

Change the ggplot() command by wrapping factor() around cyl. Hit Submit Answer and see if the resulting plot is better this time.

# Change the command below so that cyl is treated as factor
ggplot(mtcars, aes(factor(cyl), mpg)) +
  geom_point()

The grammar of graphics

Mapping data columns to aesthetics

Let’s dive a little deeper into the three main topics in this course: The data, aesthetics, and geom layers. We’ll get to making pretty plots in the last chapter with the themes layer.

We’ll continue working on the 32 cars in the mtcars data frame.

Consider how the examples and concepts we discuss throughout these courses apply to your own data-sets!

EXERCISES:

Add a color aesthetic mapped to the displacement of the car engine: inside aes(), add a color argument equal to disp.

# Edit to add a color aesthetic mapped to disp
ggplot(mtcars, aes(wt, mpg, color = disp)) +
  geom_point() 

This time, map disp to the size aesthetic.

# Change the color aesthetic to a size aesthetic
ggplot(mtcars, aes(wt, mpg, size = disp)) +
  geom_point()

ggplot2 layers

Adding geometries

The diamonds dataset contains details of 1,000 diamonds. Among the variables included are carat (a measurement of the diamond’s size) and price.

You’ll use two common geom layer functions:

geom_point() adds points (as in a scatter plot). geom_smooth() adds a smooth trend curve. As you saw previously, these are added using the + operator.

ggplot(data, aes(x, y)) + geom_() Where is the specific geometry needed.

diamonds <- read.csv("./Data/diamonds.csv")

EXECISES.

Explore the diamonds data frame with the str() function.

# Explore the diamonds data frame with str()
str(diamonds)
## 'data.frame':    1000 obs. of  11 variables:
##  $ X      : int  32672 19562 5337 18246 40025 16977 13192 30838 41853 17437 ...
##  $ carat  : num  0.31 1.5 0.9 1.01 0.33 1.08 1.07 0.33 0.44 1 ...
##  $ cut    : chr  "Ideal" "Good" "Premium" "Ideal" ...
##  $ color  : chr  "G" "G" "H" "F" ...
##  $ clarity: chr  "VS1" "SI2" "VS2" "VS2" ...
##  $ depth  : num  62.4 64.3 62.8 60.9 63.2 62 61.6 59.5 62 58.6 ...
##  $ table  : num  55 57 58 58 56 55 58 59 57 61 ...
##  $ price  : int  802 8190 3810 7411 1109 6779 5453 743 1255 6989 ...
##  $ x      : num  4.35 7.29 6.17 6.43 4.45 6.62 6.6 4.53 4.87 6.57 ...
##  $ y      : num  4.33 7.2 6.13 6.47 4.44 6.57 6.56 4.48 4.91 6.5 ...
##  $ z      : num  2.71 4.66 3.86 3.93 2.81 4.09 4.05 2.68 3.02 3.83 ...

Edit the plot code to add a point geom. Use the + operator to add geom_point() to the ggplot() command.

# Add geom_point() with +
ggplot2::ggplot(diamonds, aes(carat, price)) +
  geom_point()

Add a smooth geom to the plot. Use the + operator to add geom_smooth().

# Add geom_smooth() with +
ggplot(diamonds, aes(carat, price)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Changing one geom or every geom

If you have multiple geoms, then mapping an aesthetic to data variable inside the call to ggplot() will change all the geoms. It is also possible to make changes to individual geoms by passing arguments to the geom_*() functions.

geom_point() has an alpha argument that controls the opacity of the points. A value of 1 (the default) means that the points are totally opaque; a value of 0 means the points are totally transparent (and therefore invisible). Values in between specify transparency.

The plot you drew last time is provided in the script.

EXERCISES:

# Map the color aesthetic to clarity
ggplot(diamonds, aes(carat, price, color = clarity)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Make the points translucent by setting the alpha argument to 0.4.

# Make the points 40% opaque
ggplot(diamonds, aes(carat, price, color = clarity)) +
  geom_point(alpha = 0.4) +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Saving plots as variables

Plots can be saved as variables, which can be added two later on using the + operator. This is really useful if you want to make multiple related plots from a common base.

EXERCISES:

Using the diamonds dataset, plot the price (y-axis) versus the carat (x-axis), assigning to plt_price_vs_carat.

# Draw a ggplot
plt_price_vs_carat <- ggplot(
  # Use the diamonds dataset
  data = diamonds,
  # For the aesthetics, map x to carat and y to price
  aes(x = carat, y = price)
) 

Using geom_point(), add a point layer to plt_price_vs_carat.

# Add a point layer to plt_price_vs_carat
plt_price_vs_carat + geom_point()

Add an alpha argument to the point layer to make the points 20% opaque, assigning to plt_price_vs_carat_transparent.

Type the plot’s variable name (plt_price_vs_carat_transparent) to display it.

# From previous step
plt_price_vs_carat <- ggplot(diamonds, aes(carat, price))

# Edit this to make points 20% opaque: plt_price_vs_carat_transparent
plt_price_vs_carat_transparent <- plt_price_vs_carat + geom_point(alpha = 0.2)

# See the plot
plt_price_vs_carat_transparent

Inside geom_point(), call aes() and map color to clarity, assigning to plt_price_vs_carat_by_clarity. Type the plot’s variable name (plt_price_vs_carat_by_clarity) to display it.

# From previous step
plt_price_vs_carat <- ggplot(diamonds, aes(carat, price))

# Edit this to map color to clarity,
# Assign the updated plot to a new object
plt_price_vs_carat_by_clarity<- plt_price_vs_carat + geom_point(aes(color = clarity))

# See the plot
plt_price_vs_carat_by_clarity

Aesthetics

Visible aesthetics

All about aesthetics: color, shape and size

In the video you saw 9 visible aesthetics. Let’s apply them to a categorical variable — the cylinders in mtcars, cyl.

These are the aesthetics you can consider within aes() in this chapter: x, y, color, fill, size, alpha, labels and shape.

One common convention is that you don’t name the x and y arguments to aes(), since they almost always come first, but you do name other arguments.

In the following exercise the fcyl column is categorical. It is cyl transformed into a factor.

EXERCISES:

Map mpg onto the x aesthetic, and gear onto the y.

# Swap mpg and fcyl
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
ggplot(mtcars, aes(x = gear , y = mpg)) +
  geom_point()

Map wt onto x, mpg onto y, and gear onto color.

# Map x to wt, y to mpg and color to fcyl
ggplot(mtcars, aes(x = wt, y = mpg, color = gear)) +
  geom_point()

Modify the point layer of the previous plot by changing the shape argument to 1 and increasing the size to 4.

ggplot(mtcars, aes(wt, mpg, color = gear)) +
  # Set the shape and size of the points
  geom_point(shape = 1, size = 4)

Using attributes

All about attributes: color, shape, size and alpha

This time you’ll use these arguments to set attributes of the plot, not map variables onto aesthetics.

You can specify colors in R using hex codes: a hash followed by two hexadecimal numbers each for red, green, and blue (“#RRGGBB”). Hexadecimal is base-16 counting. You have 0 to 9, and A representing 10 up to F representing 15. Pairs of hexadecimal numbers give you a range from 0 to 255. “#000000” is “black” (no color), “#FFFFFF” means “white”, and `“#00FFFF” is cyan (mixed green and blue).

A hexadecimal color variable, my_blue has been defined for you.

EXERCISES:

Set the point color to my_blue and the alpha to 0.6.

# A hexadecimal color
my_blue <- "#4ABEFF"

ggplot(mtcars, aes(wt, mpg)) +
  # Set the point color and alpha
  geom_point(color = my_blue, alpha = 0.6)

Change the color mapping to a fill mapping. That is, fcyl should be mapped onto fill. Set the color of the points to my_blue, point size to 10 and the point shape to 1.

# A hexadecimal color
my_blue <- "#4ABEFF"

# Change the color mapping to a fill mapping
ggplot(mtcars, aes(wt, mpg, fill = gear)) +
  # Set point size and shape
  geom_point(color = my_blue, size = 10, shape = 1)

All about attributes: conflicts with aesthetics

In the videos you saw that you can use all the aesthetics as attributes. Let’s see how this works with the aesthetics you used in the previous exercises: x, y, color, fill, size, alpha, label and shape.

In this exercise you will set all kinds of attributes of the points!

You will continue to work with mtcars.

EXERCISES:

Add a point layer, setting alpha, the transparency, to 0.5.

ggplot(mtcars, aes(wt, mpg, color = gear)) +
  # Add point layer with alpha 0.5
  geom_point(alpha = 0.5)

Add a text layer, setting the label to the rownames of the dataset mtcars, and the color to “red”.

ggplot(mtcars, aes(wt, mpg, color = gear)) +
  # Add text layer with label rownames(mtcars) and color red
  geom_text(label = rownames(mtcars), color = 'red')

Add a point layer, setting the shape to 24 and the color to “yellow”.

ggplot(mtcars, aes(wt, mpg, color = gear)) +
  # Add points layer with shape 24 and color yellow
  geom_point(shape = 24, color = 'yellow')

Going all out

In this exercise, you will gradually add more aesthetics layers to the plot. You’re still working with the mtcars dataset, but this time you’re using more features of the cars. Each of the columns is described on the mtcars help page. Columns fcyl and fam have been added (as before) as categorical versions of cyl and am respectively.

Notice that adding more aesthetic mappings to your plot is not always a good idea! You may just increase complexity and decrease readability.

EXERCISES:

Use mtcars to draw a plot of qsec vs. mpg, colored by gear Add a point layer.

# 3 aesthetics: qsec vs. mpg, colored by gear
ggplot(mtcars, aes(x = mpg, y = qsec, color= gear)) +
  geom_point()

Add another aesthetic: map gear onto cyl

Modifying aesthetics

Updating aesthetic labels

In this exercise, you’ll modify some aesthetics to make a bar plot of the number of cylinders for cars with different types of transmission.

You’ll also make use of some functions for improving the appearance of the plot.

labs() to set the x- and y-axis labels. It takes strings for each argument. scale_color_manual() defines properties of the color scale (i.e. axis). The first argument sets the legend title. values is a named vector of colors to use.

EXERCISES:

Set the x-axis label to “Number of Cylinders”, and the y-axis label to “Count” using the x and y arguments of labs(), respectively.

ggplot(mtcars, aes(cyl, fill = gear)) +
  geom_bar() +
  # Set the axis labels
  labs(x = "Number of Cylinders", y ="Count")

Implement a custom fill color scale using scale_fill_manual(). Set the first argument to “Transmission”, and values to palette.

palette <- c(automatic = "#377EB8", manual = "#E41A1C")

ggplot(mtcars, aes(cyl, fill = gear)) +
  geom_bar() +
  labs(x = "Number of Cylinders", y = "Count") +
  # Set the fill color scale
  scale_fill_manual("Transmission", values = palette)

Modify the code to set the position to dodge so that the bars for transmissions are displayed side by side.

palette <- c(automatic = "#377EB8", manual = "#E41A1C")

# Set the position
ggplot(mtcars, aes(cyl, fill = gear)) +
  geom_bar(position = 'dodge') +
  labs(x = "Number of Cylinders", y = "Count")

  scale_fill_manual("Transmission", values = palette)
## <ggproto object: Class ScaleDiscrete, Scale, gg>
##     aesthetics: fill
##     axis_order: function
##     break_info: function
##     break_positions: function
##     breaks: waiver
##     call: call
##     clone: function
##     dimension: function
##     drop: TRUE
##     expand: waiver
##     get_breaks: function
##     get_breaks_minor: function
##     get_labels: function
##     get_limits: function
##     guide: legend
##     is_discrete: function
##     is_empty: function
##     labels: waiver
##     limits: NULL
##     make_sec_title: function
##     make_title: function
##     map: function
##     map_df: function
##     n.breaks.cache: NULL
##     na.translate: TRUE
##     na.value: NA
##     name: Transmission
##     palette: function
##     palette.cache: NULL
##     position: left
##     range: <ggproto object: Class RangeDiscrete, Range, gg>
##         range: NULL
##         reset: function
##         train: function
##         super:  <ggproto object: Class RangeDiscrete, Range, gg>
##     rescale: function
##     reset: function
##     scale_name: manual
##     train: function
##     train_df: function
##     transform: function
##     transform_df: function
##     super:  <ggproto object: Class ScaleDiscrete, Scale, gg>

Setting a dummy aesthetic

In the last chapter you saw that all the visible aesthetics can serve as attributes and aesthetics, but I very conveniently left out x and y. That’s because although you can make univariate plots (such as histograms, which you’ll get to in the next chapter), a y-axis will always be provided, even if you didn’t ask for it.

You can make univariate plots in ggplot2, but you will need to add a fake y axis by mapping y to zero.

When using setting y-axis limits, you can specify the limits as separate arguments, or as a single numeric vector. That is, ylim(lo, hi) or ylim(c(lo, hi)).

EXERCISES:

Using mtcars, plot 0 vs. mpg. Make a scatter plot and add “jitter” to it.

# Plot 0 vs. mpg
ggplot(mtcars, aes(x = mpg, y =0)) +
  # Add jitter 
  geom_point(position = "jitter")

Use ylim() to set the limits on the y-axis from -2 to 2.

ggplot(mtcars, aes(mpg, 0)) +
  geom_jitter() +
  # Set the y-axis limits
  ylim(-2,2)

Geometries

Scatter plots

Overplotting 1: large datasets

Scatter plots (using geom_point()) are intuitive, easily understood, and very common, but we must always consider overplotting, particularly in the following four situations:

Large datasets Aligned values on a single axis Low-precision data Integer data Typically, alpha blending (i.e. adding transparency) is recommended when using solid shapes. Alternatively, you can use opaque, hollow shapes.

Small points are suitable for large datasets with regions of high density (lots of overlapping).

Let’s use the diamonds dataset to practice dealing with the large dataset case.

diamonds <- read.csv("./Data/diamond.csv")

EXERCISES:

Add a points layer to the base plot.

Set the point transparency to 0.5. Set shape = “.”, the point size of 1 pixel.

# Plot price vs. carat, colored by clarity
plt_price_vs_carat_by_clarity <- ggplot(diamonds, aes(carat, price, color = clarity))

# Add a point layer with tiny points
plt_price_vs_carat_by_clarity + geom_point(alpha = 0.5, shape = ".")

Update the point shape to remove the line outlines by setting shape to 16.

# Plot price vs. carat, colored by clarity
plt_price_vs_carat_by_clarity <- ggplot(diamonds, aes(carat, price, color = clarity))

# Set transparency to 0.5
plt_price_vs_carat_by_clarity + geom_point(alpha = 0.5, shape = 16)

Overplotting 2: Aligned values

Let’s take a look at another case where we should be aware of overplotting: Aligning values on a single axis.

This occurs when one axis is continuous and the other is categorical, which can be overcome with some form of jittering.

In the mtcars data set, fam and fcyl are categorical variants of cyl and am.

EXERCISES:

Create a base plot plt_mpg_vs_fcyl_by_fam of fcyl by mpg, colored by fam. Add a points layer to the base plot.

# Plot base
plt_mpg_vs_fcyl_by_fam <- ggplot(mtcars, aes(x = gear, y = mpg, color = gear))

# Default points are shown for comparison
plt_mpg_vs_fcyl_by_fam + geom_point()

Add some jittering by using position_jitter(), setting the width to 0.3.

# Plot base
plt_mpg_vs_fcyl_by_fam <- ggplot(mtcars, aes(x = gear, y = mpg, color = gear))

# Default points are shown for comparison
plt_mpg_vs_fcyl_by_fam + geom_point()

# Alter the point positions by jittering, width 0.3
plt_mpg_vs_fcyl_by_fam + geom_point(position = position_jitter(width = 0.3))

Alternatively, use position_jitterdodge(). Set jitter.width and dodge.width to 0.3 to separate subgroups further.

# Plot base
plt_mpg_vs_fcyl_by_fam <- ggplot(mtcars, aes(x = gear, y = mpg, color = gear))

# Default points are shown for comparison
plt_mpg_vs_fcyl_by_fam + geom_point()

# Now jitter and dodge the point positions
plt_mpg_vs_fcyl_by_fam + geom_point(position = position_jitterdodge(jitter.width = 0.3, dodge.width = 0.3))

Overplotting 3: Low-precision data

You already saw how to deal with overplotting when using geom_point() in two cases:

Large datasets Aligned values on a single axis We used position = ‘jitter’ inside geom_point() or geom_jitter().

Let’s take a look at another case:

Low-precision data This results from low-resolution measurements like in the iris dataset, which is measured to 1mm precision (see viewer). It’s similar to case 2, but in this case we can jitter on both the x and y axis.

iris <- read.csv("./Data/iris.csv")

EXERCISES:

Change the points layer into a jitter layer. Reduce the jitter layer’s width by setting the width argument to 0.1.

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  # Swap for jitter layer with width 0.1
  geom_jitter(width = 0.1, alpha = 0.5)

Change the points layer into a jitter layer. Reduce the jitter layer’s width by setting the width argument to 0.1.

Let’s use a different approach:

Within geom_point(), set position to “jitter”.

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  # Set the position to jitter
  geom_point(alpha = 0.5, position = "jitter")

Provide an alternative specification:

Have the position argument call position_jitter() with a width of 0.1.

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  # Use a jitter position function with width 0.1
  geom_point(alpha = 0.5, position = position_jitter(width = 0.1))

Overplotting 4: Integer data

Let’s take a look at the last case of dealing with overplotting:

Integer data This can be type integer (i.e. 1 ,2, 3…) or categorical (i.e. class factor) variables. factor is just a special class of type integer.

You’ll typically have a small, defined number of intersections between two variables, which is similar to case 3, but you may miss it if you don’t realize that integer and factor data are the same as low precision data.

The Vocab dataset provided contains the years of education and vocabulary test scores from respondents to US General Social Surveys from 1972-2004.

Vocab <- read.csv("./Data/Vocab.csv", sep = ";")
colnames(Vocab)[1] <- "year"

EXERCIES:

Examine the Vocab dataset using str().

# Examine the structure of Vocab

str(Vocab)
## 'data.frame':    6 obs. of  4 variables:
##  $ year      : int  1974 1974 1974 1974 1974 1974
##  $ sex       : chr  "Male" "Male" "Female" "Female" ...
##  $ education : int  14 16 10 10 12 16
##  $ vocabulary: int  9 9 9 5 8 8

Using Vocab, draw a plot of vocabulary vs education. Add a point layer.

# Plot vocabulary vs. education
ggplot(Vocab, aes(x = education, y = vocabulary)) +
  # Add a point layer
  geom_point()

Replace the point layer with a jitter layer.

ggplot(Vocab, aes(education, vocabulary)) +
  # Change to a jitter layer
  geom_jitter()

Set the jitter transparency to 0.2.

ggplot(Vocab, aes(education, vocabulary)) +
  # Set the transparency to 0.2
  geom_jitter(alpha = 0.2)

Set the shape of the jittered points to hollow circles, (shape 1).

ggplot(Vocab, aes(education, vocabulary)) +
  # Set the shape to 1
  geom_jitter(alpha = 0.2, shape = 1)

Histograms

Drawing histograms

Recall that histograms cut up a continuous variable into discrete bins and, by default, maps the internally calculated count variable (the number of observations in each bin) onto the y aesthetic. An internal variable called density can be accessed by using the .. notation, i.e. ..density… Plotting this variable will show the relative frequency, which is the height times the width of each bin.

EXERCISES.

Using mtcars, map mpg onto the x aesthetic. Add a histogram layer using geom_histogram().

# Plot mpg
ggplot(mtcars, aes(x = mpg)) +
  # Add a histogram layer
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Set the histogram binwidth to 1.

ggplot(mtcars, aes(mpg)) +
  # Set the binwidth to 1
  geom_histogram(binwidth = 1)

Map y to the internal variable ..density.. to show frequency densities.

# Map y to ..density..
ggplot(mtcars, aes(mpg, ..density..)) +
  geom_histogram(binwidth = 1)

Set the fill color of the histogram bars to datacamp_light_blue.

datacamp_light_blue <- "#51A8C9"

ggplot(mtcars, aes(mpg, ..density..)) +
  # Set the fill color to datacamp_light_blue
  geom_histogram(binwidth = 1, fill = datacamp_light_blue)

Positions in histograms

Here, we’ll examine the various ways of applying positions to histograms. geom_histogram(), a special case of geom_bar(), has a position argument that can take on the following values:

stack (the default): Bars for different groups are stacked on top of each other. dodge: Bars for different groups are placed side by side. fill: Bars for different groups are shown as proportions. identity: Plot the values as they appear in the dataset.

EXERCISES:

Update the aesthetics so that the fill color of the bars is determined by fam.

# Update the aesthetics so the fill color is by fam
ggplot(mtcars, aes(mpg, fill = gear)) +
  geom_histogram(binwidth = 1)

Update the histogram layer to position the bars side-by-side, that is, “dodge”.

ggplot(mtcars, aes(mpg, fill = gear)) +
  geom_histogram(binwidth = 1, position  = "dodge")

Update the histogram layer so the bars’ positions “fill” the y-axis.

ggplot(mtcars, aes(mpg, fill = gear)) +
# Change the position to fill
  geom_histogram(binwidth = 1, position = "fill")
## Warning: Removed 8 rows containing missing values (geom_bar).

Update the histogram layer so bars are top of each other, using the “identity” position. So each bar can be seen, set alpha to 0.4.

ggplot(mtcars, aes(mpg, fill = gear)) +
# Change the position to identity, with transparency 0.4
  geom_histogram(binwidth = 1, position = "identity", alpha = 0.4 )

Bar plots

Position in bar and col plots

Let’s see how the position argument changes geom_bar().

We have three position options:

stack: The default dodge: Preferred fill: To show proportions While we will be using geom_bar() here, note that the function geom_col() is just geom_bar() where both the position and stat arguments are set to “identity”. It is used when we want the heights of the bars to represent the exact values in the data.

In this exercise, you’ll draw the total count of cars having a given number of cylinders (fcyl), according to manual or automatic transmission type (fam).

EXERCISES:

Using mtcars, plot fcyl, filled by fam. Add a bar layer using geom_bar().

# Plot fcyl, filled by fam
ggplot(mtcars, aes(cyl,fill = gear)) +
  # Add a bar layer
  geom_bar()

Set the bar position argument to “fill”.

# Plot fcyl, filled by fam
ggplot(mtcars, aes(cyl,fill = gear)) +
 geom_bar(position = "fill")

Change the bar position argument to “dodge”.

ggplot(mtcars, aes(cyl,fill = gear)) +
  # Change the position to "dodge"
  geom_bar(position = "dodge")

Overlapping bar plots

You can customize bar plots further by adjusting the dodging so that your bars partially overlap each other. Instead of using position = “dodge”, you’re going to use position_dodge(), like you did with position_jitter() in the the previous exercises. Here, you’ll save this as an object, posn_d, so that you can easily reuse it.

Remember, the reason you want to use position_dodge() (and position_jitter()) is to specify how much dodging (or jittering) you want.

For this example, you’ll use the mtcars dataset.

EXERCISES:

Use the functional form of the bar position: replace “dodge” with a call to position_dodge(). Set its width to 0.2.

ggplot(mtcars, aes(cyl, fill = gear)) +
  # Change position to use the functional form, with width 0.2
  geom_bar(position = position_dodge(width = 0.2))

Set the bar transparency level of the bars to 0.6.

ggplot(mtcars, aes(cyl, fill = gear)) +
# Set the transparency to 0.6
  geom_bar(position = position_dodge(width = 0.2), alpha = 0.6)

Bar plots: sequential color palette

In this bar plot, we’ll fill each segment according to an ordinal variable. The best way to do that is with a sequential color palette.

Here’s an example of using a sequential color palette with the mtcars dataset:

ggplot(mtcars, aes(fcyl, fill = fam)) + geom_bar() + scale_fill_brewer(palette = “Set1”) In the exercise, you’ll use similar code on the the Vocab dataset. Both datasets are ordinal.

EXERCISES:

Plot the Vocab dataset, mapping education onto x and vocabulary onto fill.

# Plot education, filled by vocabulary
ggplot(Vocab, aes(education, fill = vocabulary))

Add a bar layer, setting position to “fill”.

# Plot education, filled by vocabulary
ggplot(Vocab, aes(education, fill = vocabulary)) +
  # Add a bar layer with position "fill"
  geom_bar(position = "fill")

Add a brewer fill scale, using the default palette (don’t pass any arguments). Notice how this generates a warning message and an incomplete plot.

# Plot education, filled by vocabulary
ggplot(Vocab, aes(education, fill = vocabulary)) +
  # Add a bar layer with position "fill"
  geom_bar(position = "fill") +
  # Add a brewer fill scale with default palette
  scale_fill_brewer()

Multiple time series

We already saw how the form of your data affects how you can plot it. Let’s explore that further with multiple time series. Here, it’s important that all lines are on the same scale, and if possible, on the same plot.

fish.species contains the global capture rates of seven salmon species from 1950–2010. Each variable (column) is a Salmon species and each observation (row) is one year. fish.tidy contains the same data, but in three columns: Species, Year, and Capture (i.e. one variable per column).

EXERCIES:

Use str() in the console to examine the structure of both fish.species and fish.tidy. Plot only the Rainbow salmon time series with geom_line().

fish.species <- read.csv("./Data/fish.species.csv")
fish.tidy <- read.csv("./Data/fish.species.csv")
str(fish.species)
## 'data.frame':    427 obs. of  4 variables:
##  $ X      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Species: chr  "Pink" "Pink" "Pink" "Pink" ...
##  $ Year   : int  1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 ...
##  $ Capture: int  100600 259000 132600 235900 123400 244400 203400 270119 200798 200085 ...
str(fish.tidy)
## 'data.frame':    427 obs. of  4 variables:
##  $ X      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Species: chr  "Pink" "Pink" "Pink" "Pink" ...
##  $ Year   : int  1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 ...
##  $ Capture: int  100600 259000 132600 235900 123400 244400 203400 270119 200798 200085 ...
# Plot the Rainbow Salmon time series
ggplot(fish.species, aes(x = Year, y = Species)) +
  geom_line()

Plot only the Pink salmon time series with geom_line().

ggplot(fish.species, aes(x = Year, y = Species)) +
  geom_line()

# Plot the Pink Salmon time series
ggplot(fish.species, aes(Year,Species)) +
  geom_line()

Now try and plot all time series in a single plot.

Plot the fish.tidy dataset, mapping Year to x and Capture to y. group by fish species within the aesthetics of geom_line().

# Plot multiple time-series by grouping by species
ggplot(fish.tidy, aes(Year, Capture)) +
  geom_line(aes(group = Species))

Let’s add color to the previous plot to distinguish between the different time series.

Plot the fish.tidy dataset again, this time making sure to color by Species.

# Plot the Rainbow Salmon time series
ggplot(fish.species, aes(x = Year, y = Species)) +
  geom_line()

# Plot the Pink Salmon time series
ggplot(fish.species, aes(x = Year, y = Species)) +
  geom_line()

# Plot multiple time-series by grouping by species
ggplot(fish.tidy, aes(Year, Capture)) +
  geom_line(aes(group = Species))

# Plot multiple time-series by coloring by species
ggplot(fish.tidy, aes(x = Year, y = Capture, color = Species)) +
  geom_line()

Themes

Themes from scratch

Moving the legend

Let’s wrap up this course by making a publication-ready plot communicating a clear message.

To change stylistic elements of a plot, call theme() and set plot properties to a new value. For example, the following changes the legend position.

p + theme(legend.position = new_value) Here, the new value can be

“top”, “bottom”, “left”, or “right’”: place it at that side of the plot. “none”: don’t draw it. c(x, y): c(0, 0) means the bottom-left and c(1, 1) means the top-right. Let’s revisit the recession period line plot (assigned to plt_prop_unemployed_over_time).

EXERCISES:

Update the plot to remove the legend. Look at the changes in the plot.

# Plot multiple time-series by coloring by species
ggplot(fish.tidy, aes(x = Year, y = Capture, color = Species)) +
  geom_line() +
  theme(legend.position = "none")

Update the plot to position the legend at the bottom of the plot. Look at the changes in the plot.

# Plot multiple time-series by coloring by species
ggplot(fish.tidy, aes(x = Year, y = Capture, color = Species)) +
  geom_line() +
  theme(legend.position = "bottom")

Position the legend inside the plot, with x-position 0.6 and y-position 0.1. Look at the changes in the plot.

# Plot multiple time-series by coloring by species
ggplot(fish.tidy, aes(x = Year, y = Capture, color = Species)) +
  geom_line() +
  theme(legend.position = c(0.6, 0.1))

Modifying theme elements

Many plot elements have multiple properties that can be set. For example, line elements in the plot such as axes and gridlines have a color, a thickness (size), and a line type (solid line, dashed, or dotted). To set the style of a line, you use element_line(). For example, to make the axis lines into red, dashed lines, you would use the following.

p + theme(axis.line = element_line(color = “red”, linetype = “dashed”)) Similarly, element_rect() changes rectangles and element_text() changes text. You can remove a plot element using element_blank().

plt_prop_unemployed_over_time is available.

EXERCISES:

Give all rectangles in the plot, (the rect element) a fill color of “grey92” (very pale grey). Remove the legend.key’s outline by setting its color to be missing. Look at the changes in the plot.

# Plot multiple time-series by coloring by species
ggplot(fish.tidy, aes(x = Year, y = Capture, color = Species)) +
  geom_line() +
  theme(
    # For all rectangles, set the fill color to grey92
    rect = element_rect(fill = "grey92"),
    # For the legend key, turn off the outline
    legend.key = element_rect(color = NA)
  )

Remove the axis ticks, axis.ticks by making them a blank element. Remove the panel gridlines, panel.grid in the same way. Look at the changes in the plot.

# Plot multiple time-series by coloring by species
ggplot(fish.tidy, aes(x = Year, y = Capture, color = Species)) +
  geom_line()+
  theme(
    rect = element_rect(fill = "grey92"),
    legend.key = element_rect(color = NA),
    # Turn off axis ticks
    axis.ticks = element_blank(),
    # Turn off the panel grid
    panel.grid = element_blank()
  )

Add the major horizontal grid lines back to the plot using panel.grid.major.y. Set the line color to “white”, size to 0.5, and linetype to “dotted”. Look at the changes in the plot.

# Plot multiple time-series by coloring by species
ggplot(fish.tidy, aes(x = Year, y = Capture, color = Species)) +
  geom_line() +
  theme(
    rect = element_rect(fill = "grey92"),
    legend.key = element_rect(color = NA),
    axis.ticks = element_blank(),
    panel.grid = element_blank(),
    # Add major y-axis panel grid lines back
    panel.grid.major.y = element_line(
      # Set the color to white
      color = "white",
      # Set the size to 0.5
      size = 0.5,
      # Set the line type to dotted
      linetype = "dotted"
    )
  )

Make the axis tick labels’ text, axis.text, less prominent by changing the color to “grey25”. Increase the plot.title’s, size to 16 and change its font face to “italic”. Look at the changes in the plot.

# Plot multiple time-series by coloring by species
ggplot(fish.tidy, aes(x = Year, y = Capture, color = Species)) +
  geom_line() +
  theme(
    rect = element_rect(fill = "grey92"),
    legend.key = element_rect(color = NA),
    axis.ticks = element_blank(),
    panel.grid = element_blank(),
    panel.grid.major.y = element_line(
      color = "white",
      size = 0.5,
      linetype = "dotted"
    ),
    # Set the axis text color to grey25
    axis.text = element_text(color = "grey25"),
    # Set the plot title font face to italic and font size to 16
    plot.title = element_text(face = "italic", size = 16)
  )

Modifying whitespace

Whitespace means all the non-visible margins and spacing in the plot.

To set a single whitespace value, use unit(x, unit), where x is the amount and unit is the unit of measure.

Borders require you to set 4 positions, so use margin(top, right, bottom, left, unit). To remember the margin order, think TRouBLe.

The default unit is “pt” (points), which scales well with text. Other options include “cm”, “in” (inches) and “lines” (of text).

plt_mpg_vs_wt_by_cyl is available. The pane

EXERCISES

Give the axis tick length, axis.ticks.length, a unit of 2 “lines”.

# Plot multiple time-series by coloring by species
ggplot(fish.tidy, aes(x = Year, y = Capture, color = Species)) +
  geom_line()  +
  theme(
    # Set the axis tick length to 2 lines
    axis.ticks.length = unit(2, "lines")
  )

Give the legend key size, legend.key.size, a unit of 3 centimeters (“cm”).

# Plot multiple time-series by coloring by species
ggplot(fish.tidy, aes(x = Year, y = Capture, color = Species)) +
  geom_line() +
  theme(
    # Set the legend key size to 3 centimeters
    legend.key.size = unit(3, "cm")
  )

Set the legend.margin to 20 points (“pt”) on the top, 30 pts on the right, 40 pts on the bottom, and 50 pts on the left.

# Plot multiple time-series by coloring by species
ggplot(fish.tidy, aes(x = Year, y = Capture, color = Species)) +
  geom_line() +
  theme(
    # Set the legend margin to (20, 30, 40, 50) points
    legend.margin = margin(20, 30, 40, 50, "pt")
  )

Set the plot margin, plot.margin, to 10, 30, 50, and 70 millimeters (“mm”).

# Plot multiple time-series by coloring by species
ggplot(fish.tidy, aes(x = Year, y = Capture, color = Species)) +
  geom_line() +
  theme(
    # Set the plot margin to (10, 30, 50, 70) millimeters
    plot.margin = margin(10, 30, 50, 70, "mm")
  )

Theme flexibility

Built-in themes

In addition to making your own themes, there are several out-of-the-box solutions that may save you lots of time.

theme_gray() is the default. theme_bw() is useful when you use transparency. theme_classic() is more traditional. theme_void() removes everything but the data. plt_prop_unemployed_over_time is available.

EXERCISES:

Add a black and white theme, theme_bw(), to the plot. What changed in the plot?

# Plot multiple time-series by coloring by species
ggplot(fish.tidy, aes(x = Year, y = Capture, color = Species)) +
  geom_line()  +
  theme_bw()

Add a classic theme, theme_classic(), to the plot. What changed in the plot?

# Plot multiple time-series by coloring by species
ggplot(fish.tidy, aes(x = Year, y = Capture, color = Species)) +
  geom_line() +
  theme_classic()

Add a void theme, theme_void(), to the plot. What changed in the plot?

# Plot multiple time-series by coloring by species
ggplot(fish.tidy, aes(x = Year, y = Capture, color = Species)) +
  geom_line() +
  theme_void()

Exploring ggthemes

Outside of ggplot2, another source of built-in themes is the ggthemes package. The workspace already contains the plt_prop_unemployed_over_time, the line plot from before. Let’s explore some of the ready-made ggthemes themes.

plt_prop_unemployed_over_time is available.

EXERCISES:

Add a fivethirtyeight.com theme, theme_fivethirtyeight(), to the plot. What changed in the plot?

library(dplyr)
library(ggplot2)
library(ggthemes)
fish.tidy <- read.csv("./Data/fish.species.csv")
# Plot multiple time-series by coloring by species
ggplot(fish.tidy, aes(x = Year, y = Capture, color = Species)) +
  geom_line()  +
  theme_fivethirtyeight()

Add an Edward Tufte theme, theme_tufte(), to the plot. What changed in the plot?

# Plot multiple time-series by coloring by species
ggplot(fish.tidy, aes(x = Year, y = Capture, color = Species)) +
  geom_line() +
  theme_tufte()

Add a Wall Street Journal theme, theme_wsj(), to the plot. What changed in the plot?

# Plot multiple time-series by coloring by species
ggplot(fish.tidy, aes(x = Year, y = Capture, color = Species)) +
  geom_line()+
  theme_wsj()

Setting themes

Reusing a theme across many plots helps to provide a consistent style. You have several options for this.

Assign the theme to a variable, and add it to each plot. Set your theme as the default using theme_set(). A good strategy that you’ll use here is to begin with a built-in theme then modify it.

plt_prop_unemployed_over_time is available. The theme you made earlier is shown in the sample code.

EXERCISES:

Assign the theme to theme_recession. Add the Tufte theme and theme_recession together. Use the Tufte recession theme by adding it to the plot.

# Theme layer saved as an object, theme_recession
theme_recession <- theme(
  rect = element_rect(fill = "grey92"),
  legend.key = element_rect(color = NA),
  axis.ticks = element_blank(),
  panel.grid = element_blank(),
  panel.grid.major.y = element_line(color = "white", size = 0.5, linetype = "dotted"),
  axis.text = element_text(color = "grey25"),
  plot.title = element_text(face = "italic", size = 16),
  legend.position = c(0.6, 0.1)
)

# Combine the Tufte theme with theme_recession
theme_tufte_recession <- theme_tufte() + theme_recession

# Plot multiple time-series by coloring by species
ggplot(fish.tidy, aes(x = Year, y = Capture, color = Species)) +
  geom_line() + theme_tufte_recession

Publication-quality plots

We’ve seen many examples of beautiful, publication-quality plots. Let’s take a final look and put all the pieces together.

plt_prop_unemployed_over_time is available.

EXERCISES:

Add Tufte’s theme.

# Plot multiple time-series by coloring by species
ggplot(fish.tidy, aes(x = Year, y = Capture, color = Species)) +
  geom_line() +
  # Add Tufte's theme
  theme_tufte()

Call the function to add individual theme elements. Turn off the legend and the axis ticks.

# Plot multiple time-series by coloring by species
ggplot(fish.tidy, aes(x = Year, y = Capture, color = Species)) +
  geom_line() +
  theme_tufte() +
  # Add individual theme elements
  theme(
    # Turn off the legend
    legend.position = "none",
    # Turn off the axis ticks
    axis.ticks = element_blank()
  )

Set the axis title and axis text’s text color to grey60.

# Plot multiple time-series by coloring by species
ggplot(fish.tidy, aes(x = Year, y = Capture, color = Species)) +
  geom_line()  +
  theme_tufte() +
  theme(
    legend.position = "none",
    axis.ticks = element_blank(),
    # Set the axis title's text color to grey60
    axis.title = element_text(color = "grey60"),
    # Set the axis text's text color to grey60
    axis.text = element_text(color = "grey60")
  )

Set the panel gridlines major y values. Set the color to grey60, the size to 0.25, and the line type to dotted.

# Plot multiple time-series by coloring by species
ggplot(fish.tidy, aes(x = Year, y = Capture, color = Species)) +
  geom_line() +
  theme_tufte() +
  theme(
    legend.position = "none",
    axis.ticks = element_blank(),
    axis.title = element_text(color = "grey60"),
    axis.text = element_text(color = "grey60"),
    # Set the panel gridlines major y values
    panel.grid.major.y = element_line(
      # Set the color to grey60
      color = "grey60",
      # Set the size to 0.25
      size = 0.25,
      # Set the linetype to dotted
      linetype = "dotted"
    )
  )

Effective explanatory plots

Effective explanatory plots

Let’s focus on producing beautiful and effective explanatory plots. In the next couple of exercises, you’ll create a plot that is similar to the one shown in the video using gm2007, a filtered subset of the gapminder dataset.

This type of plot will be in an info-viz style, meaning that it would be similar to something you’d see in a magazine or website for a mostly lay audience.

A scatterplot of lifeExp by country, colored by lifeExp, with points of size 4, is provided.

EXERCISES:

geom_segment() adds line segments and requires two additional aesthetics: xend and yend. To draw a horizontal line for each point, map 30 onto xend and country onto yend.

colnames(gapminder)[4] <- "lifeExp"
# Add a geom_segment() layer
ggplot(gapminder, aes(x = lifeExp, y = country, color = lifeExp)) +
  geom_point(size = 4) +
  geom_segment(aes(xend = 30, yend = country), size = 2)

geom_text also needs an additional aesthetic: label. Map lifeExp onto label, and set the attributes color to “white” and size to 1.5.

# Add a geom_text() layer
ggplot(gapminder, aes(x = lifeExp, y = country, color = lifeExp)) +
  geom_point(size = 4) +
  geom_segment(aes(xend = 30, yend = country), size = 2) +
  geom_text(aes(label = lifeExp), color = "white", size = 1.5)

The color scale has been set for you, but you need to clean up the scales. For the x scale: Set expand to c(0, 0) and limits to c(30, 90). Place the axis on the top of the plot with the position argument.

library(RColorBrewer)
## Warning: package 'RColorBrewer' was built under R version 4.0.3
# Set the color scale
palette <- brewer.pal(5, "RdYlBu")[-(2:4)]

# Modify the scales
ggplot(gapminder, aes(x = lifeExp, y = country, color = lifeExp)) +
  geom_point(size = 4) +
  geom_segment(aes(xend = 30, yend = country), size = 2) +
  geom_text(aes(label = round(lifeExp,1)), color = "white", size = 1.5) +
  scale_x_continuous("", expand = c(0, 0), limits = c(30, 90), position = "top") +
  scale_color_gradientn(colors = palette)
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_segment).
## Warning: Removed 2 rows containing missing values (geom_text).

Make sure to label the plot appropriately using labs():

Make the title “Highest and lowest life expectancies, 2007”. Add a reference by setting caption to “Source: gapminder”.

# Set the color scale
palette <- brewer.pal(5, "RdYlBu")[-(2:4)]

# Add a title and caption
ggplot(gapminder, aes(x = lifeExp, y = country, color = lifeExp)) +
  geom_point(size = 4) +
  geom_segment(aes(xend = 30, yend = country), size = 2) +
  geom_text(aes(label = round(lifeExp,1)), color = "white", size = 1.5) +
  scale_x_continuous("", expand = c(0,0), limits = c(30,90), position = "top") +
  scale_color_gradientn(colors = palette) +
  labs(title = "Highest and lowest life expectancies, 2007", caption = "Source: gapminder")
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_segment).
## Warning: Removed 2 rows containing missing values (geom_text).

8.- R Introduction to Importing Data in R

Importing data from flat files with utils

Introduction & read.csv

read.csv

The utils package, which is automatically loaded in your R session on startup, can import CSV files with the read.csv() function.

In this exercise, you’ll be working with swimming_pools.csv; it contains data on swimming pools in Brisbane, Australia (Source: data.gov.au). The file contains the column names in the first row. It uses a comma to separate values within rows.

pools <- read.delim("./Data/swimming_pools.txt", sep = ",")

Type dir() in the console to list the files in your working directory. You’ll see that it contains swimming_pools.csv, so you can start straight away.

EXERCISES:
# Print the structure of pools
str(pools)
## 'data.frame':    20 obs. of  4 variables:
##  $ Name     : chr  "Acacia Ridge Leisure Centre" "Bellbowrie Pool" "Carole Park" "Centenary Pool (inner City)" ...
##  $ Address  : chr  "1391 Beaudesert Road, Acacia Ridge" "Sugarwood Street, Bellbowrie" "Cnr Boundary Road and Waterford Road Wacol" "400 Gregory Terrace, Spring Hill" ...
##  $ Latitude : num  -27.6 -27.6 -27.6 -27.5 -27.4 ...
##  $ Longitude: num  153 153 153 153 153 ...

stringsAsFactors

With stringsAsFactors, you can tell R whether it should convert strings in the flat file to factors.

For all importing functions in the utils package, this argument is TRUE, which means that you import strings as factors. This only makes sense if the strings you import represent categorical variables in R. If you set stringsAsFactors to FALSE, the data frame columns corresponding to strings in your text file will be character.

You’ll again be working with the swimming_pools.csv file. It contains two columns (Name and Address), which shouldn’t be factors.

EXERCISES:

Use read.csv() to import the data in “swimming_pools.csv” as a data frame called pools; make sure that strings are imported as characters, not as factors.

pools <- read.delim("./Data/swimming_pools.txt", sep = ",",stringsAsFactors= FALSE)

Using str(), display the structure of the dataset and check that you indeed get character vectors instead of factors.

# Check the structure of pools
str(pools)
## 'data.frame':    20 obs. of  4 variables:
##  $ Name     : chr  "Acacia Ridge Leisure Centre" "Bellbowrie Pool" "Carole Park" "Centenary Pool (inner City)" ...
##  $ Address  : chr  "1391 Beaudesert Road, Acacia Ridge" "Sugarwood Street, Bellbowrie" "Cnr Boundary Road and Waterford Road Wacol" "400 Gregory Terrace, Spring Hill" ...
##  $ Latitude : num  -27.6 -27.6 -27.6 -27.5 -27.4 ...
##  $ Longitude: num  153 153 153 153 153 ...

read.delim & read.table

read.delim

Aside from .csv files, there are also the .txt files which are basically text files. You can import these functions with read.delim(). By default, it sets the sep argument to " (fields in a record are delimited by tabs) and the header argument to TRUE (the first row contains the field names).

In this exercise, you will import hotdogs.txt, containing information on sodium and calorie levels in different hotdogs (Source: UCLA). The dataset has 3 variables, but the variable names are not available in the first line of the file. The file uses tabs as field separators.

EXERCISES:

Import the data in “hotdogs.txt” with read.delim(). Call the resulting data frame hotdogs. The variable names are not on the first line, so make sure to set the header argument appropriately.

hotdogs <- read.delim("./Data/hotdogs.txt", sep = ",",header=F)

Call summary() on hotdogs. This will print out some summary statistics about all variables in the data frame.

# Summarize hotdogs
summary(hotdogs)
##       V1           
##  Length:54         
##  Class :character  
##  Mode  :character

read.table

If you’re dealing with more exotic flat file formats, you’ll want to use read.table(). It’s the most basic importing function; you can specify tons of different arguments in this function. Unlike read.csv() and read.delim(), the header argument defaults to FALSE and the sep argument is "" by default.

Up to you again! The data is still hotdogs.txt. It has no column names in the first row, and the field separators are tabs. This time, though, the file is in the data folder inside your current working directory. A variable path with the location of this file is already coded for you.

EXERCISES:

Finish the read.table() call that’s been prepared for you. Use the path variable, and make sure to set sep correctly.

# Path to the hotdogs.txt file: path
path <- file.path("data", "hotdogs.txt")

# Import the hotdogs.txt file: hotdogs
hotdogs <- read.table(path, 
                      sep = "\t", 
                      col.names = c("type", "calories", "sodium"))

Call head() on hotdogs; this will print the first 6 observations in the data frame.

# Call head() on hotdogs
head(hotdogs)
##   type calories sodium
## 1 Beef      186    495
## 2 Beef      181    477
## 3 Beef      176    425
## 4 Beef      149    322
## 5 Beef      184    482
## 6 Beef      190    587

Arguments

Lily and Tom are having an argument because they want to share a hot dog but they can’t seem to agree on which one to choose. After some time, they simply decide that they will have one each. Lily wants to have the one with the fewest calories while Tom wants to have the one with the most sodium.

Next to calories and sodium, the hotdogs have one more variable: type. This can be one of three things: Beef, Meat, or Poultry, so a categorical variable: a factor is fine.

EXERCISS:

Finish the read.delim() call to import the data in “hotdogs.txt”. It’s a tab-delimited file without names in the first row.

# Finish the read.delim() call
hotdogs <- read.delim("./Data/hotdogs.txt", header = F, col.names = c("type", "calories", "sodium"))

The code that selects the observation with the lowest calorie count and stores it in the variable lily is already available. It uses the function which.min(), that returns the index the smallest value in a vector.

# Select the hot dog with the least calories: lily
lily <- hotdogs[which.min(hotdogs$calories), ]
# Select the observation with the most sodium: tom
tom <- hotdogs[which.max(hotdogs$sodium), ]

Do a similar thing for Tom: select the observation with the most sodium and store it in tom. Use which.max() this time.

Finally, print both the observations lily and tom.

# Print lily and tom

lily
##       type calories sodium
## 50 Poultry       86    358
tom
##    type calories sodium
## 15 Beef      190    645

Column classes

Next to column names, you can also specify the column types or column classes of the resulting data frame. You can do this by setting the colClasses argument to a vector of strings representing classes:

read.delim(“my_file.txt”, colClasses = c(“character”, “numeric”, “logical”)) This approach can be useful if you have some columns that should be factors and others that should be characters. You don’t have to bother with stringsAsFactors anymore; just state for each column what the class should be.

If a column is set to “NULL” in the colClasses vector, this column will be skipped and will not be loaded into the data frame.

EXERCISES:

The read.delim() call from before is already included and creates the hotdogs data frame. Go ahead and display the structure of hotdogs.

# Previous call to import hotdogs.txt
hotdogs <- read.delim("./Data/hotdogs.txt", header = FALSE, col.names = c("type", "calories", "sodium"))

# Display structure of hotdogs

str(hotdogs)
## 'data.frame':    54 obs. of  3 variables:
##  $ type    : chr  "Beef" "Beef" "Beef" "Beef" ...
##  $ calories: int  186 181 176 149 184 190 158 139 175 148 ...
##  $ sodium  : int  495 477 425 322 482 587 370 322 479 375 ...

Edit the second read.delim() call. Assign the correct vector to the colClasses argument. NA should be replaced with a character vector: c(“factor”, “NULL”, “numeric”).

Display the structure of hotdogs2 and look for the difference.

# Edit the colClasses argument to import the data correctly: hotdogs2
hotdogs2 <- read.delim("./Data/hotdogs.txt", header = FALSE, 
                       col.names = c("type", "calories", "sodium"),
                       colClasses = c("factor", "NULL","numeric"))


# Display structure of hotdogs2
str(hotdogs2)
## 'data.frame':    54 obs. of  2 variables:
##  $ type  : Factor w/ 3 levels "Beef","Meat",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ sodium: num  495 477 425 322 482 587 370 322 479 375 ...

readr & data.table

readr: read_csv & read_tsv

read_csv

CSV files can be imported with read_csv(). It’s a wrapper function around read_delim() that handles all the details for you. For example, it will assume that the first row contains the column names.

The dataset you’ll be working with here is potatoes.csv. It gives information on the impact of storage period and cooking on potatoes’ flavor. It uses commas to delimit fields in a record, and contains column names in the first row. The file is available in your workspace. Remember that you can inspect your workspace with dir().

EXERCISES:

Load the readr package with library(). You do not need to install the package, it is already installed on DataCamp’s servers.

# Load the readr package
library(readr)
## Warning: package 'readr' was built under R version 4.0.3

Import “potatoes.csv” using read_csv(). Assign the resulting data frame to the variable potatoes.

# Import potatoes.csv with read_csv(): potatoes
potatoes <- read.delim("./Data/potatoes.txt")

read_tsv

Where you use read_csv() to easily read in CSV files, you use read_tsv() to easily read in TSV files. TSV is short for tab-separated values.

This time, the potatoes data comes in the form of a tab-separated values file; potatoes.txt is available in your workspace. In contrast to potatoes.csv, this file does not contain columns names in the first row, though.

There’s a vector properties that you can use to specify these column names manually.

EXERCISES:

Use read_tsv() to import the potatoes data from potatoes.txt and store it in the data frame potatoes. In addition to the path to the file, you’ll also have to specify the col_names argument; you can use the properties vector for this.

# Column names
properties <- c("area", "temp", "size", "storage", "method",
                "texture", "flavor", "moistness")

Call head() on potatoes to show the first observations of your dataset.

# Import potatoes.csv with read_csv(): potatoes
potatoes <- read.delim("./Data/potatoes.txt", sep = ",",  col.names = properties)
head(potatoes)
##   area temp size storage method texture flavor moistness
## 1    1    1    1       1      1     2.9    3.2       3.0
## 2    1    1    1       1      2     2.3    2.5       2.6
## 3    1    1    1       1      3     2.5    2.8       2.8
## 4    1    1    1       1      4     2.1    2.9       2.4
## 5    1    1    1       1      5     1.9    2.8       2.2
## 6    1    1    1       2      1     1.8    3.0       1.7

readr: read_delim

read_delim

Just as read.table() was the main utils function, read_delim() is the main readr function.

read_delim() takes two mandatory arguments:

file: the file that contains the data delim: the character that separates the values in the data file You’ll again be working potatoes.txt; the file uses tabs (") to delimit values and does not contain column names in its first line. It’s available in your working directory so you can start right away. As before, the vector properties is available to set the col_names.

EXERCISES:

Import all the data in “potatoes.txt” using read_delim(); store the resulting data frame in potatoes.

# Column names
properties <- c("area", "temp", "size", "storage", "method",
                "texture", "flavor", "moistness")

# Import potatoes.txt using read_delim(): potatoes
potatoes <- read_delim("./Data/potatoes.txt", delim = "\t", col_names = properties)
## 
## -- Column specification --------------------------------------------------------
## cols(
##   area = col_character(),
##   temp = col_character(),
##   size = col_character(),
##   storage = col_character(),
##   method = col_character(),
##   texture = col_character(),
##   flavor = col_character(),
##   moistness = col_character()
## )
## Warning: 161 parsing failures.
## row col  expected    actual                  file
##   1  -- 8 columns 1 columns './Data/potatoes.txt'
##   2  -- 8 columns 1 columns './Data/potatoes.txt'
##   3  -- 8 columns 1 columns './Data/potatoes.txt'
##   4  -- 8 columns 1 columns './Data/potatoes.txt'
##   5  -- 8 columns 1 columns './Data/potatoes.txt'
## ... ... ......... ......... .....................
## See problems(...) for more details.

Print out potatoes.

# Print out potatoes
head(potatoes)
## # A tibble: 6 x 8
##   area                       temp  size  storage method texture flavor moistness
##   <chr>                      <chr> <chr> <chr>   <chr>  <chr>   <chr>  <chr>    
## 1 area,temp,size,storage,me~ <NA>  <NA>  <NA>    <NA>   <NA>    <NA>   <NA>     
## 2 1,1,1,1,1,2.9,3.2,3.0      <NA>  <NA>  <NA>    <NA>   <NA>    <NA>   <NA>     
## 3 1,1,1,1,2,2.3,2.5,2.6      <NA>  <NA>  <NA>    <NA>   <NA>    <NA>   <NA>     
## 4 1,1,1,1,3,2.5,2.8,2.8      <NA>  <NA>  <NA>    <NA>   <NA>    <NA>   <NA>     
## 5 1,1,1,1,4,2.1,2.9,2.4      <NA>  <NA>  <NA>    <NA>   <NA>    <NA>   <NA>     
## 6 1,1,1,1,5,1.9,2.8,2.2      <NA>  <NA>  <NA>    <NA>   <NA>    <NA>   <NA>

data.table: fread

fread

You still remember how to use read.table(), right? Well, fread() is a function that does the same job with very similar arguments. It is extremely easy to use and blazingly fast! Often, simply specifying the path to the file is enough to successfully import your data.

Don’t take our word for it, try it yourself! You’ll be working with the potatoes.csv file, that’s available in your workspace. Fields are delimited by commas, and the first line contains the column names.

EXERCISES:

Use library() to load (NOT install) the data.table package. You do not need to install the package, it is already installed on DataCamp’s servers. Import “potatoes.csv” with fread(). Simply pass it the file path and see if it worked. Store the result in a variable potatoes.

# load the data.table package using library()
library(data.table)
## Warning: package 'data.table' was built under R version 4.0.3
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
# Import potatoes.csv with fread(): potatoes
potatoes  <- fread("./Data/potatoes.txt")

Print out potatoes.

# Print out potatoes
head(potatoes)
##    area temp size storage method texture flavor moistness
## 1:    1    1    1       1      1     2.9    3.2       3.0
## 2:    1    1    1       1      2     2.3    2.5       2.6
## 3:    1    1    1       1      3     2.5    2.8       2.8
## 4:    1    1    1       1      4     2.1    2.9       2.4
## 5:    1    1    1       1      5     1.9    2.8       2.2
## 6:    1    1    1       2      1     1.8    3.0       1.7

fread: more advanced use

Now that you know the basics about fread(), you should know about two arguments of the function: drop and select, to drop or select variables of interest.

Suppose you have a dataset that contains 5 variables and you want to keep the first and fifth variable, named “a” and “e”. The following options will all do the trick:

fread(“path/to/file.txt”, drop = 2:4) fread(“path/to/file.txt”, select = c(1, 5)) fread(“path/to/file.txt”, drop = c(“b”, “c”, “d”)) fread(“path/to/file.txt”, select = c(“a”, “e”)) Let’s stick with potatoes since we’re particularly fond of them here at DataCamp. The data is again available in the file potatoes.csv, containing comma-separated records.

EXERCISES

Using fread() and select or drop as arguments, only import the texture and moistness columns of the flat file. They correspond to the columns 6 and 8 in “potatoes.csv”. Store the result in a variable potatoes.

# fread is already loaded

# Import columns 6 and 8 of potatoes.csv: potatoes
potatoes <- fread("./Data/potatoes.txt", select = c(6,8))

plot() 2 columns of the potatoes data frame: texture on the x-axis, moistness on the y-axis. Use the dollar sign notation twice. Feel free to name your axes and plot.

# Plot texture (x) and moistness (y) of potatoes
plot(potatoes$texture, potatoes$moistness)

Importing Excel data

readxl

List the sheets of an Excel file

Before you can start importing from Excel, you should find out which sheets are available in the workbook. You can use the excel_sheets() function for this.

You will find the Excel file urbanpop.xlsx in your working directory (type dir() to see it). This dataset contains urban population metrics for practically all countries in the world throughout time (Source: Gapminder). It contains three sheets for three different time periods. In each sheet, the first row contains the column names.

EXERCISES:

Load the readxl package using library(). It’s already installed on DataCamp’s servers.

# Load the readxl package
library(readxl)
## Warning: package 'readxl' was built under R version 4.0.3

Use excel_sheets() to print out the names of the sheets in urbanpop.xlsx.

# Print the names of all worksheets
excel_sheets("./Data/urbanpop.xlsx")
## [1] "1960-1966" "1967-1974" "1975-2011"

Import an Excel sheet

Now that you know the names of the sheets in the Excel file you want to import, it is time to import those sheets into R. You can do this with the read_excel() function. Have a look at this recipe:

data <- read_excel(“data.xlsx”, sheet = “my_sheet”) This call simply imports the sheet with the name “my_sheet” from the “data.xlsx” file. You can also pass a number to the sheet argument; this will cause read_excel() to import the sheet with the given sheet number. sheet = 1 will import the first sheet, sheet = 2 will import the second sheet, and so on.

In this exercise, you’ll continue working with the urbanpop.xlsx file.

EXERCISES:

The code to import the first and second sheets is already included. Can you add a command to also import the third sheet, and store the resulting data frame in pop_3? Store the data frames pop_1, pop_2 and pop_3 in a list, that you call pop_list. Display the structure of pop_list.

# The readxl package is already loaded

# Read the sheets, one by one
pop_1 <- read_excel("./Data/urbanpop.xlsx", sheet = 1)
pop_2 <- read_excel("./Data/urbanpop.xlsx", sheet = 2)
pop_3 <- read_excel("./Data/urbanpop.xlsx", sheet = 3)
# Put pop_1, pop_2 and pop_3 in a list: pop_list
pop_list <- list(pop_1, pop_2, pop_3)

# Display the structure of pop_list
str(pop_list)
## List of 3
##  $ : tibble [209 x 8] (S3: tbl_df/tbl/data.frame)
##   ..$ country: chr [1:209] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##   ..$ 1960   : num [1:209] 769308 494443 3293999 NA NA ...
##   ..$ 1961   : num [1:209] 814923 511803 3515148 13660 8724 ...
##   ..$ 1962   : num [1:209] 858522 529439 3739963 14166 9700 ...
##   ..$ 1963   : num [1:209] 903914 547377 3973289 14759 10748 ...
##   ..$ 1964   : num [1:209] 951226 565572 4220987 15396 11866 ...
##   ..$ 1965   : num [1:209] 1000582 583983 4488176 16045 13053 ...
##   ..$ 1966   : num [1:209] 1058743 602512 4649105 16693 14217 ...
##  $ : tibble [209 x 9] (S3: tbl_df/tbl/data.frame)
##   ..$ country: chr [1:209] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##   ..$ 1967   : num [1:209] 1119067 621180 4826104 17349 15440 ...
##   ..$ 1968   : num [1:209] 1182159 639964 5017299 17996 16727 ...
##   ..$ 1969   : num [1:209] 1248901 658853 5219332 18619 18088 ...
##   ..$ 1970   : num [1:209] 1319849 677839 5429743 19206 19529 ...
##   ..$ 1971   : num [1:209] 1409001 698932 5619042 19752 20929 ...
##   ..$ 1972   : num [1:209] 1502402 720207 5815734 20263 22406 ...
##   ..$ 1973   : num [1:209] 1598835 741681 6020647 20742 23937 ...
##   ..$ 1974   : num [1:209] 1696445 763385 6235114 21194 25482 ...
##  $ : tibble [209 x 38] (S3: tbl_df/tbl/data.frame)
##   ..$ country: chr [1:209] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##   ..$ 1975   : num [1:209] 1793266 785350 6460138 21632 27019 ...
##   ..$ 1976   : num [1:209] 1905033 807990 6774099 22047 28366 ...
##   ..$ 1977   : num [1:209] 2021308 830959 7102902 22452 29677 ...
##   ..$ 1978   : num [1:209] 2142248 854262 7447728 22899 31037 ...
##   ..$ 1979   : num [1:209] 2268015 877898 7810073 23457 32572 ...
##   ..$ 1980   : num [1:209] 2398775 901884 8190772 24177 34366 ...
##   ..$ 1981   : num [1:209] 2493265 927224 8637724 25173 36356 ...
##   ..$ 1982   : num [1:209] 2590846 952447 9105820 26342 38618 ...
##   ..$ 1983   : num [1:209] 2691612 978476 9591900 27655 40983 ...
##   ..$ 1984   : num [1:209] 2795656 1006613 10091289 29062 43207 ...
##   ..$ 1985   : num [1:209] 2903078 1037541 10600112 30524 45119 ...
##   ..$ 1986   : num [1:209] 3006983 1072365 11101757 32014 46254 ...
##   ..$ 1987   : num [1:209] 3113957 1109954 11609104 33548 47019 ...
##   ..$ 1988   : num [1:209] 3224082 1146633 12122941 35095 47669 ...
##   ..$ 1989   : num [1:209] 3337444 1177286 12645263 36618 48577 ...
##   ..$ 1990   : num [1:209] 3454129 1198293 13177079 38088 49982 ...
##   ..$ 1991   : num [1:209] 3617842 1215445 13708813 39600 51972 ...
##   ..$ 1992   : num [1:209] 3788685 1222544 14248297 41049 54469 ...
##   ..$ 1993   : num [1:209] 3966956 1222812 14789176 42443 57079 ...
##   ..$ 1994   : num [1:209] 4152960 1221364 15322651 43798 59243 ...
##   ..$ 1995   : num [1:209] 4347018 1222234 15842442 45129 60598 ...
##   ..$ 1996   : num [1:209] 4531285 1228760 16395553 46343 60927 ...
##   ..$ 1997   : num [1:209] 4722603 1238090 16935451 47527 60462 ...
##   ..$ 1998   : num [1:209] 4921227 1250366 17469200 48705 59685 ...
##   ..$ 1999   : num [1:209] 5127421 1265195 18007937 49906 59281 ...
##   ..$ 2000   : num [1:209] 5341456 1282223 18560597 51151 59719 ...
##   ..$ 2001   : num [1:209] 5564492 1315690 19198872 52341 61062 ...
##   ..$ 2002   : num [1:209] 5795940 1352278 19854835 53583 63212 ...
##   ..$ 2003   : num [1:209] 6036100 1391143 20529356 54864 65802 ...
##   ..$ 2004   : num [1:209] 6285281 1430918 21222198 56166 68301 ...
##   ..$ 2005   : num [1:209] 6543804 1470488 21932978 57474 70329 ...
##   ..$ 2006   : num [1:209] 6812538 1512255 22625052 58679 71726 ...
##   ..$ 2007   : num [1:209] 7091245 1553491 23335543 59894 72684 ...
##   ..$ 2008   : num [1:209] 7380272 1594351 24061749 61118 73335 ...
##   ..$ 2009   : num [1:209] 7679982 1635262 24799591 62357 73897 ...
##   ..$ 2010   : num [1:209] 7990746 1676545 25545622 63616 74525 ...
##   ..$ 2011   : num [1:209] 8316976 1716842 26216968 64817 75207 ...

Reading a workbook

In the previous exercise you generated a list of three Excel sheets that you imported. However, loading in every sheet manually and then merging them in a list can be quite tedious. Luckily, you can automate this with lapply(). If you have no experience with lapply(), feel free to take Chapter 4 of the Intermediate R course.

Have a look at the example code below:

my_workbook <- lapply(excel_sheets(“data.xlsx”), read_excel, path = “data.xlsx”) The read_excel() function is called multiple times on the “data.xlsx” file and each sheet is loaded in one after the other. The result is a list of data frames, each data frame representing one of the sheets in data.xlsx.

You’re still working with the urbanpop.xlsx file.

EXERCISES

Use lapply() in combination with excel_sheets() and read_excel() to read all the Excel sheets in “urbanpop.xlsx”. Name the resulting list pop_list.

# The readxl package is already loaded

# Read all Excel sheets with lapply(): pop_list
pop_list <- lapply(excel_sheets("./Data/urbanpop.xlsx"), read_excel, path = "./Data/urbanpop.xlsx")

Print the structure of pop_list.

# Display the structure of pop_list
str(pop_list)
## List of 3
##  $ : tibble [209 x 8] (S3: tbl_df/tbl/data.frame)
##   ..$ country: chr [1:209] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##   ..$ 1960   : num [1:209] 769308 494443 3293999 NA NA ...
##   ..$ 1961   : num [1:209] 814923 511803 3515148 13660 8724 ...
##   ..$ 1962   : num [1:209] 858522 529439 3739963 14166 9700 ...
##   ..$ 1963   : num [1:209] 903914 547377 3973289 14759 10748 ...
##   ..$ 1964   : num [1:209] 951226 565572 4220987 15396 11866 ...
##   ..$ 1965   : num [1:209] 1000582 583983 4488176 16045 13053 ...
##   ..$ 1966   : num [1:209] 1058743 602512 4649105 16693 14217 ...
##  $ : tibble [209 x 9] (S3: tbl_df/tbl/data.frame)
##   ..$ country: chr [1:209] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##   ..$ 1967   : num [1:209] 1119067 621180 4826104 17349 15440 ...
##   ..$ 1968   : num [1:209] 1182159 639964 5017299 17996 16727 ...
##   ..$ 1969   : num [1:209] 1248901 658853 5219332 18619 18088 ...
##   ..$ 1970   : num [1:209] 1319849 677839 5429743 19206 19529 ...
##   ..$ 1971   : num [1:209] 1409001 698932 5619042 19752 20929 ...
##   ..$ 1972   : num [1:209] 1502402 720207 5815734 20263 22406 ...
##   ..$ 1973   : num [1:209] 1598835 741681 6020647 20742 23937 ...
##   ..$ 1974   : num [1:209] 1696445 763385 6235114 21194 25482 ...
##  $ : tibble [209 x 38] (S3: tbl_df/tbl/data.frame)
##   ..$ country: chr [1:209] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##   ..$ 1975   : num [1:209] 1793266 785350 6460138 21632 27019 ...
##   ..$ 1976   : num [1:209] 1905033 807990 6774099 22047 28366 ...
##   ..$ 1977   : num [1:209] 2021308 830959 7102902 22452 29677 ...
##   ..$ 1978   : num [1:209] 2142248 854262 7447728 22899 31037 ...
##   ..$ 1979   : num [1:209] 2268015 877898 7810073 23457 32572 ...
##   ..$ 1980   : num [1:209] 2398775 901884 8190772 24177 34366 ...
##   ..$ 1981   : num [1:209] 2493265 927224 8637724 25173 36356 ...
##   ..$ 1982   : num [1:209] 2590846 952447 9105820 26342 38618 ...
##   ..$ 1983   : num [1:209] 2691612 978476 9591900 27655 40983 ...
##   ..$ 1984   : num [1:209] 2795656 1006613 10091289 29062 43207 ...
##   ..$ 1985   : num [1:209] 2903078 1037541 10600112 30524 45119 ...
##   ..$ 1986   : num [1:209] 3006983 1072365 11101757 32014 46254 ...
##   ..$ 1987   : num [1:209] 3113957 1109954 11609104 33548 47019 ...
##   ..$ 1988   : num [1:209] 3224082 1146633 12122941 35095 47669 ...
##   ..$ 1989   : num [1:209] 3337444 1177286 12645263 36618 48577 ...
##   ..$ 1990   : num [1:209] 3454129 1198293 13177079 38088 49982 ...
##   ..$ 1991   : num [1:209] 3617842 1215445 13708813 39600 51972 ...
##   ..$ 1992   : num [1:209] 3788685 1222544 14248297 41049 54469 ...
##   ..$ 1993   : num [1:209] 3966956 1222812 14789176 42443 57079 ...
##   ..$ 1994   : num [1:209] 4152960 1221364 15322651 43798 59243 ...
##   ..$ 1995   : num [1:209] 4347018 1222234 15842442 45129 60598 ...
##   ..$ 1996   : num [1:209] 4531285 1228760 16395553 46343 60927 ...
##   ..$ 1997   : num [1:209] 4722603 1238090 16935451 47527 60462 ...
##   ..$ 1998   : num [1:209] 4921227 1250366 17469200 48705 59685 ...
##   ..$ 1999   : num [1:209] 5127421 1265195 18007937 49906 59281 ...
##   ..$ 2000   : num [1:209] 5341456 1282223 18560597 51151 59719 ...
##   ..$ 2001   : num [1:209] 5564492 1315690 19198872 52341 61062 ...
##   ..$ 2002   : num [1:209] 5795940 1352278 19854835 53583 63212 ...
##   ..$ 2003   : num [1:209] 6036100 1391143 20529356 54864 65802 ...
##   ..$ 2004   : num [1:209] 6285281 1430918 21222198 56166 68301 ...
##   ..$ 2005   : num [1:209] 6543804 1470488 21932978 57474 70329 ...
##   ..$ 2006   : num [1:209] 6812538 1512255 22625052 58679 71726 ...
##   ..$ 2007   : num [1:209] 7091245 1553491 23335543 59894 72684 ...
##   ..$ 2008   : num [1:209] 7380272 1594351 24061749 61118 73335 ...
##   ..$ 2009   : num [1:209] 7679982 1635262 24799591 62357 73897 ...
##   ..$ 2010   : num [1:209] 7990746 1676545 25545622 63616 74525 ...
##   ..$ 2011   : num [1:209] 8316976 1716842 26216968 64817 75207 ...

readxl (2)

The col_names argument

Apart from path and sheet, there are several other arguments you can specify in read_excel(). One of these arguments is called col_names.

By default it is TRUE, denoting whether the first row in the Excel sheets contains the column names. If this is not the case, you can set col_names to FALSE. In this case, R will choose column names for you. You can also choose to set col_names to a character vector with names for each column. It works exactly the same as in the readr package.

You’ll be working with the urbanpop_nonames.xlsx file. It contains the same data as urbanpop.xlsx but has no column names in the first row of the excel sheets.

EXERCISES:

Import the first Excel sheet of “urbanpop_nonames.xlsx” and store the result in pop_a. Have R set the column names of the resulting data frame itself.

# Import the first Excel sheet of urbanpop_nonames.xlsx (R gives names): pop_a
pop_a <- read_excel("./Data/urbanpop_nonames.xlsx", col_names = FALSE)
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...

Import the first Excel sheet of urbanpop_nonames.xlsx; this time, use the cols vector that has already been preparedfor you to specify the column names. Store the resulting data frame in pop_b.

# Import the first Excel sheet of urbanpop_nonames.xlsx (specify col_names): pop_b
cols <- c("country", paste0("year_", 1960:1966))
pop_b <- read_excel("./Data/urbanpop_nonames.xlsx", col_names = cols)

Print out the summary of pop_a.

# Print the summary of pop_a

summary(pop_a)
##      ...1                ...2                ...3                ...4          
##  Length:209         Min.   :     3378   Min.   :     1028   Min.   :     1090  
##  Class :character   1st Qu.:    88978   1st Qu.:    70644   1st Qu.:    74974  
##  Mode  :character   Median :   580675   Median :   570159   Median :   593968  
##                     Mean   :  4988124   Mean   :  4991613   Mean   :  5141592  
##                     3rd Qu.:  3077228   3rd Qu.:  2807280   3rd Qu.:  2948396  
##                     Max.   :126469700   Max.   :129268133   Max.   :131974143  
##                     NA's   :11                                                 
##       ...5                ...6                ...7          
##  Min.   :     1154   Min.   :     1218   Min.   :     1281  
##  1st Qu.:    81870   1st Qu.:    84953   1st Qu.:    88633  
##  Median :   619331   Median :   645262   Median :   679109  
##  Mean   :  5303711   Mean   :  5468966   Mean   :  5637394  
##  3rd Qu.:  3148941   3rd Qu.:  3296444   3rd Qu.:  3317422  
##  Max.   :134599886   Max.   :137205240   Max.   :139663053  
##                                                             
##       ...8          
##  Min.   :     1349  
##  1st Qu.:    93638  
##  Median :   735139  
##  Mean   :  5790281  
##  3rd Qu.:  3418036  
##  Max.   :141962708  
## 

Print out the summary of pop_b. Can you spot the difference with the other summary?

# Print the summary of pop_b
summary(pop_b)
##    country            year_1960           year_1961           year_1962        
##  Length:209         Min.   :     3378   Min.   :     1028   Min.   :     1090  
##  Class :character   1st Qu.:    88978   1st Qu.:    70644   1st Qu.:    74974  
##  Mode  :character   Median :   580675   Median :   570159   Median :   593968  
##                     Mean   :  4988124   Mean   :  4991613   Mean   :  5141592  
##                     3rd Qu.:  3077228   3rd Qu.:  2807280   3rd Qu.:  2948396  
##                     Max.   :126469700   Max.   :129268133   Max.   :131974143  
##                     NA's   :11                                                 
##    year_1963           year_1964           year_1965        
##  Min.   :     1154   Min.   :     1218   Min.   :     1281  
##  1st Qu.:    81870   1st Qu.:    84953   1st Qu.:    88633  
##  Median :   619331   Median :   645262   Median :   679109  
##  Mean   :  5303711   Mean   :  5468966   Mean   :  5637394  
##  3rd Qu.:  3148941   3rd Qu.:  3296444   3rd Qu.:  3317422  
##  Max.   :134599886   Max.   :137205240   Max.   :139663053  
##                                                             
##    year_1966        
##  Min.   :     1349  
##  1st Qu.:    93638  
##  Median :   735139  
##  Mean   :  5790281  
##  3rd Qu.:  3418036  
##  Max.   :141962708  
## 

The skip argument

Another argument that can be very useful when reading in Excel files that are less tidy, is skip. With skip, you can tell R to ignore a specified number of rows inside the Excel sheets you’re trying to pull data from. Have a look at this example:

read_excel(“data.xlsx”, skip = 15) In this case, the first 15 rows in the first sheet of “data.xlsx” are ignored.

If the first row of this sheet contained the column names, this information will also be ignored by readxl. Make sure to set col_names to FALSE or manually specify column names in this case!

The file urbanpop.xlsx is available in your directory; it has column names in the first rows.

EXERCISES:

Import the second sheet of “urbanpop.xlsx”, but skip the first 21 rows. Make sure to set col_names = FALSE. Store the resulting data frame in a variable urbanpop_sel.

# The readxl package is already loaded

# Import the second sheet of urbanpop.xlsx, skipping the first 21 rows: urbanpop_sel

urbanpop_sel <- read_excel("./Data/urbanpop.xlsx", col_names = FALSE, skip = 21, sheet = 2)
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...

Select the first observation from urbanpop_sel and print it out.

# Print out the first observation from urbanpop_sel

urbanpop_sel[1,]
## # A tibble: 1 x 9
##   ...1     ...2    ...3    ...4    ...5    ...6    ...7    ...8    ...9
##   <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 Benin 382022. 411859. 443013. 475611. 515820. 557938. 602093. 648410.

gdata

Import a local file

In this part of the chapter you’ll learn how to import .xls files using the gdata package. Similar to the readxl package, you can import single Excel sheets from Excel sheets to start your analysis in R.

You’ll be working with the urbanpop.xls dataset, the .xls version of the Excel file you’ve been working with before. It’s available in your current working directory.

EXERCISES.

Load the gdata package with library(). gdata and Perl are already installed on DataCamp’s Servers.

library(gdata)
## Warning: package 'gdata' was built under R version 4.0.3
## gdata: Unable to locate valid perl interpreter
## gdata: 
## gdata: read.xls() will be unable to read Excel XLS and XLSX files
## gdata: unless the 'perl=' argument is used to specify the location of a
## gdata: valid perl intrpreter.
## gdata: 
## gdata: (To avoid display of this message in the future, please ensure
## gdata: perl is installed and available on the executable search path.)
## gdata: Unable to load perl libaries needed by read.xls()
## gdata: to support 'XLX' (Excel 97-2004) files.
## 
## gdata: Unable to load perl libaries needed by read.xls()
## gdata: to support 'XLSX' (Excel 2007+) files.
## 
## gdata: Run the function 'installXLSXsupport()'
## gdata: to automatically download and install the perl
## gdata: libaries needed to support Excel XLS and XLSX formats.
## 
## Attaching package: 'gdata'
## The following objects are masked from 'package:data.table':
## 
##     first, last
## The following objects are masked from 'package:dplyr':
## 
##     combine, first, last
## The following object is masked from 'package:stats':
## 
##     nobs
## The following object is masked from 'package:utils':
## 
##     object.size
## The following object is masked from 'package:base':
## 
##     startsWith

Import the second sheet, named “1967-1974”, of “urbanpop.xls” with read.xls(). Store the resulting data frame as urban_pop.

# Import the second sheet of urbanpop.xls: urban_pop
#urban_pop <- read.xls("./Data/urbanpop.xls", sheet = 2)

Reproducible Excel work with XLConnect

Reading sheets

Connect to a workbook

When working with XLConnect, the first step will be to load a workbook in your R session with loadWorkbook(); this function will build a “bridge” between your Excel file and your R session.

In this and the following exercises, you will continue to work with urbanpop.xlsx, containing urban population data throughout time. The Excel file is available in your current working directory.

EXERCISES:

Load the XLConnect package using library(); it is already installed on DataCamp’s servers. Use loadWorkbook() to build a connection to the “urbanpop.xlsx” file in R. Call the workbook my_book.

# urbanpop.xlsx is available in your working directory

Print out the class of my_book. What does this tell you?ç

library(openxlsx)
## Warning: package 'openxlsx' was built under R version 4.0.3
# Build connection to urbanpop.xlsx: my_book
my_book  <- loadWorkbook("./Data/urbanpop.xlsx")

# Print out the class of my_book
class(my_book)
## [1] "Workbook"
## attr(,"package")
## [1] "openxlsx"

List and read Excel sheets

Just as readxl and gdata, you can use XLConnect to import data from Excel file into R.

To list the sheets in an Excel file, use getSheets(). To actually import data from a sheet, you can use readWorksheet(). Both functions require an XLConnect workbook object as the first argument.

You’ll again be working with urbanpop.xlsx. The my_book object that links to this Excel file has already been created.

EXERCISES.

Print out the sheets of the Excel file that my_book links to.

# Build connection to urbanpop.xlsx
my_book <- loadWorkbook("./Data/urbanpop.xlsx")

Import the second sheet in my_book as a data frame. Print it out.

# List the sheets in my_book
#getSheetNames(my_book)

Adapting sheets

Add worksheet

Where readxl and gdata were only able to import Excel data, XLConnect’s approach of providing an actual interface to an Excel file makes it able to edit your Excel files from inside R. In this exercise, you’ll create a new sheet. In the next exercise, you’ll populate the sheet with data, and save the results in a new Excel file.

You’ll continue to work with urbanpop.xlsx. The my_book object that links to this Excel file is already available.

EXERCISES:

Use createSheet(), to create a new sheet in my_book, named “data_summary”.

# XLConnect is already available

# Build connection to urbanpop.xlsx
my_book <- loadWorkbook("./Data/urbanpop.xlsx")

Use getSheets() to verify that my_book now represents an Excel file with four sheets.

9.- Intermediate Importing Data in R

Importing data from databases (Part 1)

Connect to a database

Establish a connection

The first step to import data from a SQL database is creating a connection to it. As Filip explained, you need different packages depending on the database you want to connect to. All of these packages do this in a uniform way, as specified in the DBI package.

dbConnect() creates a connection between your R session and a SQL database. The first argument has to be a DBIdriver object, that specifies how connections are made and how data is mapped between R and the database. Specifically for MySQL databases, you can build such a driver with RMySQL::MySQL().

If the MySQL database is a remote database hosted on a server, you’ll also have to specify the following arguments in dbConnect(): dbname, host, port, user and password. Most of these details have already been provided.

TO DO:

Load the DBI library, which is already installed on DataCamp’s servers. Edit the dbConnect() call to connect to the MySQL database. Change the port argument (3306) and user argument (“student”).

# Load the DBI package
library(RMySQL)
library(DBI)
# Edit dbConnect() call
con <- dbConnect(RMySQL::MySQL(), 
                 dbname = "tweater", 
                 host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com", 
                 port = 3306,
                 user = "student",
                 password = "datacamp")

Import table data

  1. Import table data After successfully connecting to a database,

  2. con like this, you’ll probably want to see what’s in there.

  3. List and import tables The first step is listing all the table in the database. You can do this with the dbListTables function. Simply pass the con variable. As expected, we get a character vector of length three, corresponding to the table names I’ve introduced earlier. Next, you can choose to actually read the data from one of these tables, for example from the employees table. You use the dbReadTable function for this. Again, you specify the connection to use, con, but this time you also specify which table data you want to import: The result is a data frame, with exactly the same contents as in the original database table. DBI also specifies functions to create new tables, store new data in tables and to remove tables, but this is not really related to importing data so I won’t talk about that here. The functions we’ve covered up to now already provide a pretty good starting point. Oh no, wait, there’s one last thing! It’s always polite to explicitly disconnect your database after you’re done. You do this with dbDisconnect, as follows. If you now try to print con, you’ll see that it is no longer available. Good riddance,

  4. Let’s practice! off to the exercises now!

List the database tables

After you’ve successfully connected to a remote MySQL database, the next step is to see what tables the database contains. You can do this with the dbListTables() function. As you might remember from the video, this function requires the connection object as an input, and outputs a character vector with the table names.

TODO:

Add code to create a vector tables, that contains the tables in the tweater database. You can connect to this database through the con object. Display the structure of tables; what’s the class of this vector?

# Load the DBI package
library(DBI)

# Connect to the MySQL database: con
con <- dbConnect(RMySQL::MySQL(), 
                 dbname = "tweater", 
                 host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com", 
                 port = 3306,
                 user = "student",
                 password = "datacamp")

# Build a vector of table names: tables

tables <- dbListTables(con)
# Display structure of tables
str(tables)
##  chr [1:3] "comments" "tweats" "users"

Import users

As you might have guessed by now, the database contains data on a more tasty version of Twitter, namely Tweater. Users can post tweats with short recipes for delicious snacks. People can comment on these tweats. There are three tables: users, tweats, and comments that have relations among them. Which ones, you ask? You’ll discover in a moment!

Let’s start by importing the data on the users into your R session. You do this with the dbReadTable() function. Simply pass it the connection object (con), followed by the name of the table you want to import. The resulting object is a standard R data frame.

TODO:

Add code that imports the “users” table from the tweater database and store the resulting data frame as users. Print the users data frame.

# Load the DBI package
library(DBI)

# Connect to the MySQL database: con
con <- dbConnect(RMySQL::MySQL(), 
                 dbname = "tweater", 
                 host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com", 
                 port = 3306,
                 user = "student",
                 password = "datacamp")

# Import the users table from tweater: users
users <- dbReadTable(con, "users")

# Print users
users
##   id      name     login
## 1  1 elisabeth  elismith
## 2  2      mike     mikey
## 3  3      thea   teatime
## 4  4    thomas tomatotom
## 5  5    oliver olivander
## 6  6      kate  katebenn
## 7  7    anjali    lianja

Import all tables

Next to the users, we’re also interested in the tweats and comments tables. However, separate dbReadTable() calls for each and every one of the tables in your database would mean a lot of code duplication. Remember about the lapply() function? You can use it again here! A connection is already coded for you, as well as a vector table_names, containing the names of all the tables in the database.

TODO:

Finish the lapply() function to import the users, tweats and comments tables in a single call. The result, a list of data frames, will be stored in the variable tables. Print tables to check if you got it right.

## [[1]]
##      id tweat_id user_id            message
## 1  1022       87       7              nice!
## 2  1000       77       7             great!
## 3  1011       49       5            love it
## 4  1012       87       1   awesome! thanks!
## 5  1010       88       6              yuck!
## 6  1026       77       4      not my thing!
## 7  1004       49       1  this is fabulous!
## 8  1030       75       6           so easy!
## 9  1025       88       2             oh yes
## 10 1007       49       3           serious?
## 11 1020       77       1 couldn't be better
## 12 1014       77       1       saved my day
## 
## [[2]]
##   id user_id
## 1 75       3
## 2 88       4
## 3 77       6
## 4 87       5
## 5 49       1
## 6 24       7
##                                                                  post
## 1                                       break egg. bake egg. eat egg.
## 2                           wash strawberries. add ice. blend. enjoy.
## 3                       2 slices of bread. add cheese. grill. heaven.
## 4               open and crush avocado. add shrimps. perfect starter.
## 5 nachos. add tomato sauce, minced meat and cheese. oven for 10 mins.
## 6                              just eat an apple. simply and healthy.
##         date
## 1 2015-09-05
## 2 2015-09-14
## 3 2015-09-21
## 4 2015-09-22
## 5 2015-09-22
## 6 2015-09-24
## 
## [[3]]
##   id      name     login
## 1  1 elisabeth  elismith
## 2  2      mike     mikey
## 3  3      thea   teatime
## 4  4    thomas tomatotom
## 5  5    oliver olivander
## 6  6      kate  katebenn
## 7  7    anjali    lianja

Importing data from databases (Part 2)

SQL Queries from inside R

Query tweater (1)

In your life as a data scientist, you’ll often be working with huge databases that contain tables with millions of rows. If you want to do some analyses on this data, it’s possible that you only need a fraction of this data. In this case, it’s a good idea to send SQL queries to your database, and only import the data you actually need into R.

dbGetQuery() is what you need. As usual, you first pass the connection object to it. The second argument is an SQL query in the form of a character string. This example selects the age variable from the people dataset where gender equals “male”:

dbGetQuery(con, “SELECT age FROM people WHERE gender = ‘male’”) A connection to the tweater database has already been coded for you.

TODO:

Use dbGetQuery() to create a data frame, elisabeth, that selects the tweat_id column from the comments table where elisabeth is the commenter, her user_id is 1 Print out elisabeth so you can see if you queried the database correctly.

# Connect to the database
library(DBI)
con <- dbConnect(RMySQL::MySQL(),
                 dbname = "tweater",
                 host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
                 port = 3306,
                 user = "student",
                 password = "datacamp")

# Import tweat_id column of comments where user_id is 1: elisabeth

elisabeth <- dbGetQuery(con, "SELECT tweat_id FROM comments WHERE user_id = 1")
# Print elisabeth
elisabeth
##   tweat_id
## 1       87
## 2       49
## 3       77
## 4       77

Query tweater (2)

Apart from checking equality, you can also check for less than and greater than relationships, with < and >, just like in R.

con, a connection to the tweater database, is again available.

TODO:

Create a data frame, latest, that selects the post column from the tweats table observations where the date is higher than ‘2015-09-21’. Print out latest.

# Connect to the database
library(DBI)
con <- dbConnect(RMySQL::MySQL(),
                 dbname = "tweater",
                 host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
                 port = 3306,
                 user = "student",
                 password = "datacamp")

# Import post column of tweats where date is higher than '2015-09-21': latest

latest <- dbGetQuery(con, "SELECT post FROM tweats WHERE date > '2015-09-21'") 
# Print latest
latest
##                                                                  post
## 1               open and crush avocado. add shrimps. perfect starter.
## 2 nachos. add tomato sauce, minced meat and cheese. oven for 10 mins.
## 3                              just eat an apple. simply and healthy.

Query tweater (3)

Suppose that you have a people table, with a bunch of information. This time, you want to find out the age and country of married males. Provided that there is a married column that’s 1 when the person in question is married, the following query would work.

SELECT age, country FROM people WHERE gender = “male” AND married = 1 Can you use a similar approach for a more specialized query on the tweater database?

TODO:

Create an R data frame, specific, that selects the message column from the comments table where the tweat_id is 77 and the user_id is greater than 4. Print specific.

# Connect to the database
library(DBI)
con <- dbConnect(RMySQL::MySQL(),
                 dbname = "tweater",
                 host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
                 port = 3306,
                 user = "student",
                 password = "datacamp")

# Create data frame specific
specific <- dbGetQuery(con, "SELECT message FROM comments WHERE tweat_id = 77 AND user_id > 4") 
# Print specific
specific
##   message
## 1  great!

Query tweater (4)

There are also dedicated SQL functions that you can use in the WHERE clause of an SQL query. For example, CHAR_LENGTH() returns the number of characters in a string.

TODO:

Create a data frame, short, that selects the id and name columns from the users table where the number of characters in the name is strictly less than 5. Print short.

# Connect to the database
library(DBI)
con <- dbConnect(RMySQL::MySQL(),
                 dbname = "tweater",
                 host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
                 port = 3306,
                 user = "student",
                 password = "datacamp")

# Create data frame short

short <- dbGetQuery(con, "SELECT id, name FROM users WHERE CHAR_LENGTH(name) < 5") 
# Print short
short
##   id name
## 1  2 mike
## 2  3 thea
## 3  6 kate

DBI internals

Send - Fetch - Clear

You’ve used dbGetQuery() multiple times now. This is a virtual function from the DBI package, but is actually implemented by the RMySQL package. Behind the scenes, the following steps are performed:

Sending the specified query with dbSendQuery(); Fetching the result of executing the query on the database with dbFetch(); Clearing the result with dbClearResult(). Let’s not use dbGetQuery() this time and implement the steps above. This is tedious to write, but it gives you the ability to fetch the query’s result in chunks rather than all at once. You can do this by specifying the n argument inside dbFetch().

TODO:

Inspect the dbSendQuery() call that has already been coded for you. It selects the comments for the users with an id above 4. Use dbFetch() twice. In the first call, import only two records of the query result by setting the n argument to 2. In the second call, import all remaining queries (don’t specify n). In both calls, simply print the resulting data frames. Clear res with dbClearResult().

library(DBI)
con <- dbConnect(RMySQL::MySQL(),
                 dbname = "tweater",
                 host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
                 port = 3306,
                 user = "student",
                 password = "datacamp")

# Send query to the database
res <- dbSendQuery(con, "SELECT * FROM comments WHERE user_id > 4")

# Use dbFetch() twice
dbFetch(res, n = 2)
##     id tweat_id user_id message
## 1 1022       87       7   nice!
## 2 1000       77       7  great!
dbFetch(res)
##     id tweat_id user_id  message
## 1 1011       49       5  love it
## 2 1010       88       6    yuck!
## 3 1030       75       6 so easy!
# Clear res
dbClearResult(res)
## [1] TRUE

Be polite and …

Every time you connect to a database using dbConnect(), you’re creating a new connection to the database you’re referencing. RMySQL automatically specifies a maximum of open connections and closes some of the connections for you, but still: it’s always polite to manually disconnect from the database afterwards. You do this with the dbDisconnect() function.

The code that connects you to the database is already available, can you finish the script?

TODO:

Using the technique you prefer, build a data frame long_tweats. It selects the post and date columns from the observations in tweats where the character length of the post variable exceeds 40. Print long_tweats. Disconnect from the database by using dbDisconnect().

# Connect to the database
library(DBI)
con <- dbConnect(RMySQL::MySQL(),
                 dbname = "tweater",
                 host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
                 port = 3306,
                 user = "student",
                 password = "datacamp")

# Create the data frame  long_tweats
long_tweats <- dbGetQuery(con, "SELECT post, date FROM tweats WHERE CHAR_LENGTH(post) > 40")

# Print long_tweats
print(long_tweats)
##                                                                  post
## 1                           wash strawberries. add ice. blend. enjoy.
## 2                       2 slices of bread. add cheese. grill. heaven.
## 3               open and crush avocado. add shrimps. perfect starter.
## 4 nachos. add tomato sauce, minced meat and cheese. oven for 10 mins.
##         date
## 1 2015-09-14
## 2 2015-09-21
## 3 2015-09-22
## 4 2015-09-22
# Disconnect from the database
dbDisconnect(con)
## [1] TRUE

HTTP

Import flat files from the web

In the video, you saw that the utils functions to import flat file data, such as read.csv() and read.delim(), are capable of automatically importing from URLs that point to flat files on the web.

You must be wondering whether Hadley Wickham’s alternative package, readr, is equally potent. Well, figure it out in this exercise! The URLs for both a .csv file as well as a .delim file are already coded for you. It’s up to you to actually import the data. If it works, that is…

TODO:

Load the readr package. It’s already installed on DataCamp’s servers. Use url_csv to read in the .csv file it is pointing to. Use the read_csv() function. The .csv contains column names in the first row. Save the resulting data frame as pools. Similarly, use url_delim to read in the online .txt file. Use the read_tsv() function and store the result as potatoes. Print pools and potatoes. Looks correct?

# Load the readr package
library(readr)

# Import the csv file: pools
url_csv <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/swimming_pools.csv"
pools <- read_csv(url_csv)

# Import the txt file: potatoes
url_delim <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/potatoes.txt"
potatoes <- read_tsv(url_delim)

# Print pools and potatoes
pools
## # A tibble: 20 x 4
##    Name                         Address                       Latitude Longitude
##    <chr>                        <chr>                            <dbl>     <dbl>
##  1 Acacia Ridge Leisure Centre  1391 Beaudesert Road, Acacia~    -27.6      153.
##  2 Bellbowrie Pool              Sugarwood Street, Bellbowrie     -27.6      153.
##  3 Carole Park                  Cnr Boundary Road and Waterf~    -27.6      153.
##  4 Centenary Pool (inner City)  400 Gregory Terrace, Spring ~    -27.5      153.
##  5 Chermside Pool               375 Hamilton Road, Chermside     -27.4      153.
##  6 Colmslie Pool (Morningside)  400 Lytton Road, Morningside     -27.5      153.
##  7 Spring Hill Baths (inner Ci~ 14 Torrington Street, Spring~    -27.5      153.
##  8 Dunlop Park Pool (Corinda)   794 Oxley Road, Corinda          -27.5      153.
##  9 Fortitude Valley Pool        432 Wickham Street, Fortitud~    -27.5      153.
## 10 Hibiscus Sports Complex (up~ 90 Klumpp Road, Upper Mount ~    -27.6      153.
## 11 Ithaca Pool ( Paddington)    131 Caxton Street, Paddington    -27.5      153.
## 12 Jindalee Pool                11 Yallambee Road, Jindalee      -27.5      153.
## 13 Manly Pool                   1 Fairlead Crescent, Manly       -27.5      153.
## 14 Mt Gravatt East Aquatic Cen~ Cnr wecker Road and Newnham ~    -27.5      153.
## 15 Musgrave Park Pool (South B~ 100 Edmonstone Street, South~    -27.5      153.
## 16 Newmarket Pool               71 Alderson Stret, Newmarket     -27.4      153.
## 17 Runcorn Pool                 37 Bonemill Road, Runcorn        -27.6      153.
## 18 Sandgate Pool                231 Flinders Parade, Sandgate    -27.3      153.
## 19 Langlands Parks Pool (Stone~ 5 Panitya Street, Stones Cor~    -27.5      153.
## 20 Yeronga Park Pool            81 School Road, Yeronga          -27.5      153.
potatoes
## # A tibble: 160 x 8
##     area  temp  size storage method texture flavor moistness
##    <dbl> <dbl> <dbl>   <dbl>  <dbl>   <dbl>  <dbl>     <dbl>
##  1     1     1     1       1      1     2.9    3.2       3  
##  2     1     1     1       1      2     2.3    2.5       2.6
##  3     1     1     1       1      3     2.5    2.8       2.8
##  4     1     1     1       1      4     2.1    2.9       2.4
##  5     1     1     1       1      5     1.9    2.8       2.2
##  6     1     1     1       2      1     1.8    3         1.7
##  7     1     1     1       2      2     2.6    3.1       2.4
##  8     1     1     1       2      3     3      3         2.9
##  9     1     1     1       2      4     2.2    3.2       2.5
## 10     1     1     1       2      5     2      2.8       1.9
## # ... with 150 more rows

Secure importing

In the previous exercises, you have been working with URLs that all start with http://. There is, however, a safer alternative to HTTP, namely HTTPS, which stands for HypterText Transfer Protocol Secure. Just remember this: HTTPS is relatively safe, HTTP is not.

Luckily for us, you can use the standard importing functions with https:// connections since R version 3.2.2.

TODO:

Take a look at the URL in url_csv. It uses a secure connection, https://. Use read.csv() to import the file at url_csv. The .csv file it is referring to contains column names in the first row. Call it pools1. Load the readr package. It’s already installed on DataCamp’s servers. Use read_csv() to read in the same .csv file in url_csv. Call it pools2. Print out the structure of pools1 and pools2. Looks like the importing went equally well as with a normal http connection!

# https URL to the swimming_pools csv file.
url_csv <- "https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/swimming_pools.csv"

# Import the file using read.csv(): pools1
pools1 <- read.csv(url_csv)

# Load the readr package
library(readr)

# Import the file using read_csv(): pools2
pools2 <- read_csv(url_csv)

# Print the structure of pools1 and pools2
str(pools1)
## 'data.frame':    20 obs. of  4 variables:
##  $ Name     : chr  "Acacia Ridge Leisure Centre" "Bellbowrie Pool" "Carole Park" "Centenary Pool (inner City)" ...
##  $ Address  : chr  "1391 Beaudesert Road, Acacia Ridge" "Sugarwood Street, Bellbowrie" "Cnr Boundary Road and Waterford Road Wacol" "400 Gregory Terrace, Spring Hill" ...
##  $ Latitude : num  -27.6 -27.6 -27.6 -27.5 -27.4 ...
##  $ Longitude: num  153 153 153 153 153 ...
str(pools2)
## tibble [20 x 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Name     : chr [1:20] "Acacia Ridge Leisure Centre" "Bellbowrie Pool" "Carole Park" "Centenary Pool (inner City)" ...
##  $ Address  : chr [1:20] "1391 Beaudesert Road, Acacia Ridge" "Sugarwood Street, Bellbowrie" "Cnr Boundary Road and Waterford Road Wacol" "400 Gregory Terrace, Spring Hill" ...
##  $ Latitude : num [1:20] -27.6 -27.6 -27.6 -27.5 -27.4 ...
##  $ Longitude: num [1:20] 153 153 153 153 153 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Name = col_character(),
##   ..   Address = col_character(),
##   ..   Latitude = col_double(),
##   ..   Longitude = col_double()
##   .. )

Downloading files

Import Excel files from the web

When you learned about gdata, it was already mentioned that gdata can handle .xls files that are on the internet. readxl can’t, at least not yet. The URL with which you’ll be working is already available in the sample code. You will import it once using gdata and once with the readxl package via a workaround.

TO DO :

Load the readxl and gdata packages. They are already installed on DataCamp’s servers. Import the .xls file located at the URL url_xls using read.xls() from gdata. Store the resulting data frame as excel_gdata. You can not use read_excel() directly with a URL. Complete the following instructions to work around this problem: Use download.file() to download the .xls file behind the URL and store it locally as “local_latitude.xls”. Call read_excel() to import the local file, “local_latitude.xls”. Name the resulting data frame excel_readxl.

Downloading any file, secure or not

In the previous exercise you’ve seen how you can read excel files on the web using the read_excel package by first downloading the file with the download.file() function.

There’s more: with download.file() you can download any kind of file from the web, using HTTP and HTTPS: images, executable files, but also .RData files. An RData file is very efficient format to store R data.

You can load data from an RData file using the load() function, but this function does not accept a URL string as an argument. In this exercise, you’ll first download the RData file securely, and then import the local data file.

TO DO :

Take a look at the URL in url_rdata. It uses a secure connection, https://. This URL points to an RData file containing a data frame with some metrics on different kinds of wine. Download the file at url_rdata using download.file(). Call the file “wine_local.RData” in your working directory. Load the file you created, wine_local.RData, using the load() function. It takes one argument, the path to the file, which is just the filename in our case. After running this command, the variable wine will automatically be available in your workspace. Print out the summary() of the wine dataset.

# https URL to the wine RData file.
url_rdata <- "https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/wine.RData"

# Download the wine file to your working directory
download.file(url_rdata, destfile = "wine_local.RData")

# Load the wine data into your workspace using load()
load("wine_local.RData")

# Print out the summary of the wine data
summary(wine)
##     Alcohol        Malic acid        Ash        Alcalinity of ash
##  Min.   :11.03   Min.   :0.74   Min.   :1.360   Min.   :10.60    
##  1st Qu.:12.36   1st Qu.:1.60   1st Qu.:2.210   1st Qu.:17.20    
##  Median :13.05   Median :1.87   Median :2.360   Median :19.50    
##  Mean   :12.99   Mean   :2.34   Mean   :2.366   Mean   :19.52    
##  3rd Qu.:13.67   3rd Qu.:3.10   3rd Qu.:2.560   3rd Qu.:21.50    
##  Max.   :14.83   Max.   :5.80   Max.   :3.230   Max.   :30.00    
##    Magnesium      Total phenols     Flavanoids    Nonflavanoid phenols
##  Min.   : 70.00   Min.   :0.980   Min.   :0.340   Min.   :0.1300      
##  1st Qu.: 88.00   1st Qu.:1.740   1st Qu.:1.200   1st Qu.:0.2700      
##  Median : 98.00   Median :2.350   Median :2.130   Median :0.3400      
##  Mean   : 99.59   Mean   :2.292   Mean   :2.023   Mean   :0.3623      
##  3rd Qu.:107.00   3rd Qu.:2.800   3rd Qu.:2.860   3rd Qu.:0.4400      
##  Max.   :162.00   Max.   :3.880   Max.   :5.080   Max.   :0.6600      
##  Proanthocyanins Color intensity       Hue           Proline      
##  Min.   :0.410   Min.   : 1.280   Min.   :1.270   Min.   : 278.0  
##  1st Qu.:1.250   1st Qu.: 3.210   1st Qu.:1.930   1st Qu.: 500.0  
##  Median :1.550   Median : 4.680   Median :2.780   Median : 672.0  
##  Mean   :1.587   Mean   : 5.055   Mean   :2.604   Mean   : 745.1  
##  3rd Qu.:1.950   3rd Qu.: 6.200   3rd Qu.:3.170   3rd Qu.: 985.0  
##  Max.   :3.580   Max.   :13.000   Max.   :4.000   Max.   :1680.0

HTTP? httr! (1)

Downloading a file from the Internet means sending a GET request and receiving the file you asked for. Internally, all the previously discussed functions use a GET request to download files.

httr provides a convenient function, GET() to execute this GET request. The result is a response object, that provides easy access to the status code, content-type and, of course, the actual content.

You can extract the content from the request using the content() function. At the time of writing, there are three ways to retrieve this content: as a raw object, as a character vector, or an R object, such as a list. If you don’t tell content() how to retrieve the content through the as argument, it’ll try its best to figure out which type is most appropriate based on the content-type.

TO DO:

Load the httr package. It’s already installed on DataCamp’s servers. Use GET() to get the URL stored in url. Store the result of this GET() call as resp. Print the resp object. What information does it contain? Get the content of resp using content() and set the as argument to “raw”. Assign the resulting vector to raw_content. Print the first values in raw_content with head().

# Load the httr package
library(httr)

# Get the url, save response to resp
url <- "http://www.example.com/"
resp <- GET(url)

# Print resp
resp
## Response [http://www.example.com/]
##   Date: 2021-01-31 10:47
##   Status: 200
##   Content-Type: text/html; charset=UTF-8
##   Size: 1.26 kB
## <!doctype html>
## <html>
## <head>
##     <title>Example Domain</title>
## 
##     <meta charset="utf-8" />
##     <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
##     <meta name="viewport" content="width=device-width, initial-scale=1" />
##     <style type="text/css">
##     body {
## ...
# Get the raw content of resp: raw_content
raw_content <- content(resp, as = "raw")

# Print the head of raw_content
head(raw_content)
## [1] 3c 21 64 6f 63 74

HTTP? httr! (2)

Web content does not limit itself to HTML pages and files stored on remote servers such as DataCamp’s Amazon S3 instances. There are many other data formats out there. A very common one is JSON. This format is very often used by so-called Web APIs, interfaces to web servers with which you as a client can communicate to get or store information in more complicated ways.

You’ll learn about Web APIs and JSON in the video and exercises that follow, but some experimentation never hurts, does it?

TO DO:

Use GET() to get the url that has already been specified in the sample code. Store the response as resp. Print resp. What is the content-type? Use content() to get the content of resp. Set the as argument to “text”. Simply print out the result. What do you see? Use content() to get the content of resp, but this time do not specify a second argument. R figures out automatically that you’re dealing with a JSON, and converts the JSON to a named R list.

# httr is already loaded

# Get the url
url <- "http://www.omdbapi.com/?apikey=72bc447a&t=Annie+Hall&y=&plot=short&r=json"
resp <- GET(url)

# Print resp
resp
## Response [http://www.omdbapi.com/?apikey=72bc447a&t=Annie+Hall&y=&plot=short&r=json]
##   Date: 2021-01-31 10:47
##   Status: 200
##   Content-Type: application/json; charset=utf-8
##   Size: 940 B
# Print content of resp as text
content(resp, as = "text")
## [1] "{\"Title\":\"Annie Hall\",\"Year\":\"1977\",\"Rated\":\"PG\",\"Released\":\"20 Apr 1977\",\"Runtime\":\"93 min\",\"Genre\":\"Comedy, Romance\",\"Director\":\"Woody Allen\",\"Writer\":\"Woody Allen, Marshall Brickman\",\"Actors\":\"Woody Allen, Diane Keaton, Tony Roberts, Carol Kane\",\"Plot\":\"Neurotic New York comedian Alvy Singer falls in love with the ditzy Annie Hall.\",\"Language\":\"English, German\",\"Country\":\"USA\",\"Awards\":\"Won 4 Oscars. Another 26 wins & 8 nominations.\",\"Poster\":\"https://m.media-amazon.com/images/M/MV5BZDg1OGQ4YzgtM2Y2NS00NjA3LWFjYTctMDRlMDI3NWE1OTUyXkEyXkFqcGdeQXVyMjUzOTY1NTc@._V1_SX300.jpg\",\"Ratings\":[{\"Source\":\"Internet Movie Database\",\"Value\":\"8.0/10\"},{\"Source\":\"Rotten Tomatoes\",\"Value\":\"98%\"},{\"Source\":\"Metacritic\",\"Value\":\"92/100\"}],\"Metascore\":\"92\",\"imdbRating\":\"8.0\",\"imdbVotes\":\"251,396\",\"imdbID\":\"tt0075686\",\"Type\":\"movie\",\"DVD\":\"N/A\",\"BoxOffice\":\"$38,251,425\",\"Production\":\"Rollins-Joffe Productions\",\"Website\":\"N/A\",\"Response\":\"True\"}"
# Print content of resp
content(resp)
## $Title
## [1] "Annie Hall"
## 
## $Year
## [1] "1977"
## 
## $Rated
## [1] "PG"
## 
## $Released
## [1] "20 Apr 1977"
## 
## $Runtime
## [1] "93 min"
## 
## $Genre
## [1] "Comedy, Romance"
## 
## $Director
## [1] "Woody Allen"
## 
## $Writer
## [1] "Woody Allen, Marshall Brickman"
## 
## $Actors
## [1] "Woody Allen, Diane Keaton, Tony Roberts, Carol Kane"
## 
## $Plot
## [1] "Neurotic New York comedian Alvy Singer falls in love with the ditzy Annie Hall."
## 
## $Language
## [1] "English, German"
## 
## $Country
## [1] "USA"
## 
## $Awards
## [1] "Won 4 Oscars. Another 26 wins & 8 nominations."
## 
## $Poster
## [1] "https://m.media-amazon.com/images/M/MV5BZDg1OGQ4YzgtM2Y2NS00NjA3LWFjYTctMDRlMDI3NWE1OTUyXkEyXkFqcGdeQXVyMjUzOTY1NTc@._V1_SX300.jpg"
## 
## $Ratings
## $Ratings[[1]]
## $Ratings[[1]]$Source
## [1] "Internet Movie Database"
## 
## $Ratings[[1]]$Value
## [1] "8.0/10"
## 
## 
## $Ratings[[2]]
## $Ratings[[2]]$Source
## [1] "Rotten Tomatoes"
## 
## $Ratings[[2]]$Value
## [1] "98%"
## 
## 
## $Ratings[[3]]
## $Ratings[[3]]$Source
## [1] "Metacritic"
## 
## $Ratings[[3]]$Value
## [1] "92/100"
## 
## 
## 
## $Metascore
## [1] "92"
## 
## $imdbRating
## [1] "8.0"
## 
## $imdbVotes
## [1] "251,396"
## 
## $imdbID
## [1] "tt0075686"
## 
## $Type
## [1] "movie"
## 
## $DVD
## [1] "N/A"
## 
## $BoxOffice
## [1] "$38,251,425"
## 
## $Production
## [1] "Rollins-Joffe Productions"
## 
## $Website
## [1] "N/A"
## 
## $Response
## [1] "True"

Importing data from the web (Part 2)

APIs & JSON

From JSON to R

In the simplest setting, fromJSON() can convert character strings that represent JSON data into a nicely structured R list. Give it a try!

TO DO :

Load the jsonlite package. It’s already installed on DataCamp’s servers. wine_json represents a JSON. Use fromJSON() to convert it to a list, named wine. Display the structure of wine

# Load the jsonlite package
library(jsonlite)

# wine_json is a JSON
wine_json <- '{"name":"Chateau Migraine", "year":1997, "alcohol_pct":12.4, "color":"red", "awarded":false}'

# Convert wine_json into a list: wine
wine <- fromJSON(wine_json)

# Print structure of wine
str(wine)
## List of 5
##  $ name       : chr "Chateau Migraine"
##  $ year       : int 1997
##  $ alcohol_pct: num 12.4
##  $ color      : chr "red"
##  $ awarded    : logi FALSE

Quandl API

As Filip showed in the video, fromJSON() also works if you pass a URL as a character string or the path to a local file that contains JSON data. Let’s try this out on the Quandl API, where you can fetch all sorts of financial and economical data.

TO DO :

quandl_url represents a URL. Use fromJSON() directly on this URL and store the result in quandl_data. Display the structure of quandl_data.

# jsonlite is preloaded

# Definition of quandl_url
quandl_url <- "https://www.quandl.com/api/v3/datasets/WIKI/FB/data.json?auth_token=i83asDsiWUUyfoypkgMz"

# Import Quandl data: quandl_data
quandl_data <- fromJSON(quandl_url)

# Print structure of quandl_data
str(quandl_data)
## List of 1
##  $ dataset_data:List of 10
##   ..$ limit       : NULL
##   ..$ transform   : NULL
##   ..$ column_index: NULL
##   ..$ column_names: chr [1:13] "Date" "Open" "High" "Low" ...
##   ..$ start_date  : chr "2012-05-18"
##   ..$ end_date    : chr "2018-03-27"
##   ..$ frequency   : chr "daily"
##   ..$ data        : chr [1:1472, 1:13] "2018-03-27" "2018-03-26" "2018-03-23" "2018-03-22" ...
##   ..$ collapse    : NULL
##   ..$ order       : NULL

OMDb API

In the video, you saw how easy it is to interact with an API once you know how to formulate requests. You also saw how to fetch all information on Rain Man from OMDb. Simply perform a GET() call, and next ask for the contents with the content() function. This content() function, which is part of the httr package, uses jsonlite behind the scenes to import the JSON data into R.

However, by now you also know that jsonlite can handle URLs itself. Simply passing the request URL to fromJSON() will get your data into R. In this exercise, you will be using this technique to compare the release year of two movies in the Open Movie Database.

TO DO :

Two URLs are included in the sample code, as well as a fromJSON() call to build sw4. Add a similar call to build sw3. Print out the element named Title of both sw4 and sw3. You can use the $ operator. What movies are we dealing with here? Write an expression that evaluates to TRUE if sw4 was released later than sw3. This information is stored in the Year element of the named lists.

# The package jsonlite is already loaded

# Definition of the URLs
url_sw4 <- "http://www.omdbapi.com/?apikey=72bc447a&i=tt0076759&r=json"
url_sw3 <- "http://www.omdbapi.com/?apikey=72bc447a&i=tt0121766&r=json"

# Import two URLs with fromJSON(): sw4 and sw3
sw4 <- fromJSON(url_sw4)
sw3 <- fromJSON(url_sw3)

# Print out the Title element of both lists
sw4$Title
## [1] "Star Wars: Episode IV - A New Hope"
sw3$Title
## [1] "Star Wars: Episode III - Revenge of the Sith"
# Is the release year of sw4 later than sw3?
sw3$Year < sw4$Year
## [1] FALSE

JSON & jsonlite

JSON practice (1)

JSON is built on two structures: objects and arrays. To help you experiment with these, two JSON strings are included in the sample code. It’s up to you to change them appropriately and then call jsonlite’s fromJSON() function on them each time.

TO DO:

Change the assignment of json1 such that the R vector after conversion contains the numbers 1 up to 6, in ascending order. Next, call fromJSON() on json1. Adapt the code for json2 such that it’s converted to a named list with two elements: a, containing the numbers 1, 2 and 3 and b, containing the numbers 4, 5 and 6. Next, call fromJSON() on json2.

# jsonlite is already loaded

# Challenge 1
json1 <- '[1, 2, 3, 4, 5, 6]'
fromJSON(json1)
## [1] 1 2 3 4 5 6
# Challenge 2
json2 <- '{"a": [1, 2, 3], "b": [4, 5, 6]}'
fromJSON(json2)
## $a
## [1] 1 2 3
## 
## $b
## [1] 4 5 6

JSON practice (2)

We prepared two more JSON strings in the sample code. Can you change them and call jsonlite’s fromJSON() function on them, similar to the previous exercise?

TO DO:

Remove characters from json1 to build a 2 by 2 matrix containing only 1, 2, 3 and 4. Call fromJSON() on json1. Add characters to json2 such that the data frame in which the json is converted contains an additional observation in the last row. For this observations, a equals 5 and b equals 6. Call fromJSON() one last time, on json2.

# jsonlite is already loaded

# Challenge 1
json1 <- '[[1, 2], [3, 4]]'
fromJSON(json1)
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
# Challenge 2
json2 <- '[{"a": 1, "b": 2}, {"a": 3, "b": 4}, {"a": 5, "b": 6}]'
fromJSON(json2)
##   a b
## 1 1 2
## 2 3 4
## 3 5 6

toJSON()

Apart from converting JSON to R with fromJSON(), you can also use toJSON() to convert R data to a JSON format. In its most basic use, you simply pass this function an R object to convert to a JSON. The result is an R object of the class json, which is basically a character string representing that JSON.

For this exercise, you will be working with a .csv file containing information on the amount of desalinated water that is produced around the world. As you’ll see, it contains a lot of missing values. This data can be found on the URL that is specified in the sample code.

TO DO:

Use a function of the utils package to import the .csv file directly from the URL specified in url_csv. Save the resulting data frame as water. Make sure that strings are not imported as factors. Convert the data frame water to a JSON. Call the resulting object water_json. Print out water_json.

# jsonlite is already loaded

# URL pointing to the .csv file
url_csv <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/water.csv"

# Import the .csv file located at url_csv
water <- read.csv(url_csv, stringsAsFactors = FALSE)

# Convert the data file according to the requirements
water_json <- toJSON(water)

# Print out water_json
water_json
## [{"water":"Algeria","X1992":0.064,"X2002":0.017},{"water":"American Samoa"},{"water":"Angola","X1992":0.0001,"X2002":0.0001},{"water":"Antigua and Barbuda","X1992":0.0033},{"water":"Argentina","X1992":0.0007,"X1997":0.0007,"X2002":0.0007},{"water":"Australia","X1992":0.0298,"X2002":0.0298},{"water":"Austria","X1992":0.0022,"X2002":0.0022},{"water":"Bahamas","X1992":0.0013,"X2002":0.0074},{"water":"Bahrain","X1992":0.0441,"X2002":0.0441,"X2007":0.1024},{"water":"Barbados","X2007":0.0146},{"water":"British Virgin Islands","X2007":0.0042},{"water":"Canada","X1992":0.0027,"X2002":0.0027},{"water":"Cape Verde","X1992":0.002,"X1997":0.0017},{"water":"Cayman Islands","X1992":0.0033},{"water":"Central African Rep."},{"water":"Chile","X1992":0.0048,"X2002":0.0048},{"water":"Colombia","X1992":0.0027,"X2002":0.0027},{"water":"Cuba","X1992":0.0069,"X1997":0.0069,"X2002":0.0069},{"water":"Cyprus","X1992":0.003,"X1997":0.003,"X2002":0.0335},{"water":"Czech Rep.","X1992":0.0002,"X2002":0.0002},{"water":"Denmark","X1992":0.015,"X2002":0.015},{"water":"Djibouti","X1992":0.0001,"X2002":0.0001},{"water":"Ecuador","X1992":0.0022,"X1997":0.0022,"X2002":0.0022},{"water":"Egypt","X1992":0.025,"X1997":0.025,"X2002":0.1},{"water":"El Salvador","X1992":0.0001,"X2002":0.0001},{"water":"Finland","X1992":0.0001,"X2002":0.0001},{"water":"France","X1992":0.0117,"X2002":0.0117},{"water":"Gibraltar","X1992":0.0077},{"water":"Greece","X1992":0.01,"X2002":0.01},{"water":"Honduras","X1992":0.0002,"X2002":0.0002},{"water":"Hungary","X1992":0.0002,"X2002":0.0002},{"water":"India","X1997":0.0005,"X2002":0.0005},{"water":"Indonesia","X1992":0.0187,"X2002":0.0187},{"water":"Iran","X1992":0.003,"X1997":0.003,"X2002":0.003,"X2007":0.2},{"water":"Iraq","X1997":0.0074,"X2002":0.0074},{"water":"Ireland","X1992":0.0002,"X2002":0.0002},{"water":"Israel","X1992":0.0256,"X2002":0.0256,"X2007":0.14},{"water":"Italy","X1992":0.0973,"X2002":0.0973},{"water":"Jamaica","X1992":0.0005,"X1997":0.0005,"X2002":0.0005},{"water":"Japan","X1997":0.04,"X2002":0.04},{"water":"Jordan","X1997":0.002,"X2007":0.0098},{"water":"Kazakhstan","X1997":1.328,"X2002":1.328},{"water":"Kuwait","X1992":0.507,"X1997":0.231,"X2002":0.4202},{"water":"Lebanon","X2007":0.0473},{"water":"Libya","X2002":0.018},{"water":"Malaysia","X1992":0.0043,"X2002":0.0043},{"water":"Maldives","X1992":0.0004},{"water":"Malta","X1992":0.024,"X1997":0.031,"X2002":0.031},{"water":"Marshall Islands","X1992":0.0007},{"water":"Mauritania","X1992":0.002,"X2002":0.002},{"water":"Mexico","X1992":0.0307,"X2002":0.0307},{"water":"Morocco","X1992":0.0034,"X1997":0.0034,"X2002":0.007},{"water":"Namibia","X1992":0.0003,"X2002":0.0003},{"water":"Netherlands Antilles","X1992":0.063},{"water":"Nicaragua","X1992":0.0002,"X2002":0.0002},{"water":"Nigeria","X1992":0.003,"X2002":0.003},{"water":"Norway","X1992":0.0001,"X2002":0.0001},{"water":"Oman","X1997":0.034,"X2002":0.034,"X2007":0.109},{"water":"Peru","X1992":0.0054,"X2002":0.0054},{"water":"Poland","X1992":0.007,"X2002":0.007},{"water":"Portugal","X1992":0.0016,"X2002":0.0016},{"water":"Qatar","X1992":0.065,"X1997":0.099,"X2002":0.099,"X2007":0.18},{"water":"Saudi Arabia","X1992":0.683,"X1997":0.727,"X2002":0.863,"X2007":1.033},{"water":"Senegal","X1992":0,"X2002":0},{"water":"Somalia","X1992":0.0001,"X2002":0.0001},{"water":"South Africa","X1992":0.018,"X2002":0.018},{"water":"Spain","X1992":0.1002,"X2002":0.1002},{"water":"Sudan","X1992":0.0004,"X1997":0.0004,"X2002":0.0004},{"water":"Sweden","X1992":0.0002,"X2002":0.0002},{"water":"Trinidad and Tobago","X2007":0.036},{"water":"Tunisia","X1992":0.008,"X2002":0.013},{"water":"Turkey","X1992":0.0005,"X2002":0.0005,"X2007":0.0005},{"water":"United Arab Emirates","X1992":0.163,"X1997":0.385,"X2007":0.95},{"water":"United Kingdom","X1992":0.0333,"X2002":0.0333},{"water":"United States","X1992":0.58,"X2002":0.58},{"water":"Venezuela","X1992":0.0052,"X2002":0.0052},{"water":"Yemen, Rep.","X1992":0.01,"X2002":0.01}]

Minify and prettify

JSONs can come in different formats. Take these two JSONs, that are in fact exactly the same: the first one is in a minified format, the second one is in a pretty format with indentation, whitespace and new lines:

Unless you’re a computer, you surely prefer the second version. However, the standard form that toJSON() returns, is the minified version, as it is more concise. You can adapt this behavior by setting the pretty argument inside toJSON() to TRUE. If you already have a JSON string, you can use prettify() or minify() to make the JSON pretty or as concise as possible.

TO DO:

Convert the mtcars dataset, which is available in R by default, to a pretty JSON. Call the resulting JSON pretty_json. Print out pretty_json. Can you understand the output easily? Convert pretty_json to a minimal version using minify(). Store this version under a new variable, mini_json. Print out mini_json. Which version do you prefer, the pretty one or the minified one?

# jsonlite is already loaded

# Convert mtcars to a pretty JSON: pretty_json
pretty_json <- toJSON(mtcars,pretty = TRUE)

# Print pretty_json
head(pretty_json)
## [1] "[\n  {\n    \"mpg\": 21,\n    \"cyl\": 6,\n    \"disp\": 160,\n    \"hp\": 110,\n    \"drat\": 3.9,\n    \"wt\": 2.62,\n    \"qsec\": 16.46,\n    \"vs\": 0,\n    \"am\": 1,\n    \"gear\": 4,\n    \"carb\": 4,\n    \"_row\": \"Mazda RX4\"\n  },\n  {\n    \"mpg\": 21,\n    \"cyl\": 6,\n    \"disp\": 160,\n    \"hp\": 110,\n    \"drat\": 3.9,\n    \"wt\": 2.875,\n    \"qsec\": 17.02,\n    \"vs\": 0,\n    \"am\": 1,\n    \"gear\": 4,\n    \"carb\": 4,\n    \"_row\": \"Mazda RX4 Wag\"\n  },\n  {\n    \"mpg\": 22.8,\n    \"cyl\": 4,\n    \"disp\": 108,\n    \"hp\": 93,\n    \"drat\": 3.85,\n    \"wt\": 2.32,\n    \"qsec\": 18.61,\n    \"vs\": 1,\n    \"am\": 1,\n    \"gear\": 4,\n    \"carb\": 1,\n    \"_row\": \"Datsun 710\"\n  },\n  {\n    \"mpg\": 21.4,\n    \"cyl\": 6,\n    \"disp\": 258,\n    \"hp\": 110,\n    \"drat\": 3.08,\n    \"wt\": 3.215,\n    \"qsec\": 19.44,\n    \"vs\": 1,\n    \"am\": 0,\n    \"gear\": 3,\n    \"carb\": 1,\n    \"_row\": \"Hornet 4 Drive\"\n  },\n  {\n    \"mpg\": 18.7,\n    \"cyl\": 8,\n    \"disp\": 360,\n    \"hp\": 175,\n    \"drat\": 3.15,\n    \"wt\": 3.44,\n    \"qsec\": 17.02,\n    \"vs\": 0,\n    \"am\": 0,\n    \"gear\": 3,\n    \"carb\": 2,\n    \"_row\": \"Hornet Sportabout\"\n  },\n  {\n    \"mpg\": 18.1,\n    \"cyl\": 6,\n    \"disp\": 225,\n    \"hp\": 105,\n    \"drat\": 2.76,\n    \"wt\": 3.46,\n    \"qsec\": 20.22,\n    \"vs\": 1,\n    \"am\": 0,\n    \"gear\": 3,\n    \"carb\": 1,\n    \"_row\": \"Valiant\"\n  },\n  {\n    \"mpg\": 14.3,\n    \"cyl\": 8,\n    \"disp\": 360,\n    \"hp\": 245,\n    \"drat\": 3.21,\n    \"wt\": 3.57,\n    \"qsec\": 15.84,\n    \"vs\": 0,\n    \"am\": 0,\n    \"gear\": 3,\n    \"carb\": 4,\n    \"_row\": \"Duster 360\"\n  },\n  {\n    \"mpg\": 24.4,\n    \"cyl\": 4,\n    \"disp\": 146.7,\n    \"hp\": 62,\n    \"drat\": 3.69,\n    \"wt\": 3.19,\n    \"qsec\": 20,\n    \"vs\": 1,\n    \"am\": 0,\n    \"gear\": 4,\n    \"carb\": 2,\n    \"_row\": \"Merc 240D\"\n  },\n  {\n    \"mpg\": 22.8,\n    \"cyl\": 4,\n    \"disp\": 140.8,\n    \"hp\": 95,\n    \"drat\": 3.92,\n    \"wt\": 3.15,\n    \"qsec\": 22.9,\n    \"vs\": 1,\n    \"am\": 0,\n    \"gear\": 4,\n    \"carb\": 2,\n    \"_row\": \"Merc 230\"\n  },\n  {\n    \"mpg\": 19.2,\n    \"cyl\": 6,\n    \"disp\": 167.6,\n    \"hp\": 123,\n    \"drat\": 3.92,\n    \"wt\": 3.44,\n    \"qsec\": 18.3,\n    \"vs\": 1,\n    \"am\": 0,\n    \"gear\": 4,\n    \"carb\": 4,\n    \"_row\": \"Merc 280\"\n  },\n  {\n    \"mpg\": 17.8,\n    \"cyl\": 6,\n    \"disp\": 167.6,\n    \"hp\": 123,\n    \"drat\": 3.92,\n    \"wt\": 3.44,\n    \"qsec\": 18.9,\n    \"vs\": 1,\n    \"am\": 0,\n    \"gear\": 4,\n    \"carb\": 4,\n    \"_row\": \"Merc 280C\"\n  },\n  {\n    \"mpg\": 16.4,\n    \"cyl\": 8,\n    \"disp\": 275.8,\n    \"hp\": 180,\n    \"drat\": 3.07,\n    \"wt\": 4.07,\n    \"qsec\": 17.4,\n    \"vs\": 0,\n    \"am\": 0,\n    \"gear\": 3,\n    \"carb\": 3,\n    \"_row\": \"Merc 450SE\"\n  },\n  {\n    \"mpg\": 17.3,\n    \"cyl\": 8,\n    \"disp\": 275.8,\n    \"hp\": 180,\n    \"drat\": 3.07,\n    \"wt\": 3.73,\n    \"qsec\": 17.6,\n    \"vs\": 0,\n    \"am\": 0,\n    \"gear\": 3,\n    \"carb\": 3,\n    \"_row\": \"Merc 450SL\"\n  },\n  {\n    \"mpg\": 15.2,\n    \"cyl\": 8,\n    \"disp\": 275.8,\n    \"hp\": 180,\n    \"drat\": 3.07,\n    \"wt\": 3.78,\n    \"qsec\": 18,\n    \"vs\": 0,\n    \"am\": 0,\n    \"gear\": 3,\n    \"carb\": 3,\n    \"_row\": \"Merc 450SLC\"\n  },\n  {\n    \"mpg\": 10.4,\n    \"cyl\": 8,\n    \"disp\": 472,\n    \"hp\": 205,\n    \"drat\": 2.93,\n    \"wt\": 5.25,\n    \"qsec\": 17.98,\n    \"vs\": 0,\n    \"am\": 0,\n    \"gear\": 3,\n    \"carb\": 4,\n    \"_row\": \"Cadillac Fleetwood\"\n  },\n  {\n    \"mpg\": 10.4,\n    \"cyl\": 8,\n    \"disp\": 460,\n    \"hp\": 215,\n    \"drat\": 3,\n    \"wt\": 5.424,\n    \"qsec\": 17.82,\n    \"vs\": 0,\n    \"am\": 0,\n    \"gear\": 3,\n    \"carb\": 4,\n    \"_row\": \"Lincoln Continental\"\n  },\n  {\n    \"mpg\": 14.7,\n    \"cyl\": 8,\n    \"disp\": 440,\n    \"hp\": 230,\n    \"drat\": 3.23,\n    \"wt\": 5.345,\n    \"qsec\": 17.42,\n    \"vs\": 0,\n    \"am\": 0,\n    \"gear\": 3,\n    \"carb\": 4,\n    \"_row\": \"Chrysler Imperial\"\n  },\n  {\n    \"mpg\": 32.4,\n    \"cyl\": 4,\n    \"disp\": 78.7,\n    \"hp\": 66,\n    \"drat\": 4.08,\n    \"wt\": 2.2,\n    \"qsec\": 19.47,\n    \"vs\": 1,\n    \"am\": 1,\n    \"gear\": 4,\n    \"carb\": 1,\n    \"_row\": \"Fiat 128\"\n  },\n  {\n    \"mpg\": 30.4,\n    \"cyl\": 4,\n    \"disp\": 75.7,\n    \"hp\": 52,\n    \"drat\": 4.93,\n    \"wt\": 1.615,\n    \"qsec\": 18.52,\n    \"vs\": 1,\n    \"am\": 1,\n    \"gear\": 4,\n    \"carb\": 2,\n    \"_row\": \"Honda Civic\"\n  },\n  {\n    \"mpg\": 33.9,\n    \"cyl\": 4,\n    \"disp\": 71.1,\n    \"hp\": 65,\n    \"drat\": 4.22,\n    \"wt\": 1.835,\n    \"qsec\": 19.9,\n    \"vs\": 1,\n    \"am\": 1,\n    \"gear\": 4,\n    \"carb\": 1,\n    \"_row\": \"Toyota Corolla\"\n  },\n  {\n    \"mpg\": 21.5,\n    \"cyl\": 4,\n    \"disp\": 120.1,\n    \"hp\": 97,\n    \"drat\": 3.7,\n    \"wt\": 2.465,\n    \"qsec\": 20.01,\n    \"vs\": 1,\n    \"am\": 0,\n    \"gear\": 3,\n    \"carb\": 1,\n    \"_row\": \"Toyota Corona\"\n  },\n  {\n    \"mpg\": 15.5,\n    \"cyl\": 8,\n    \"disp\": 318,\n    \"hp\": 150,\n    \"drat\": 2.76,\n    \"wt\": 3.52,\n    \"qsec\": 16.87,\n    \"vs\": 0,\n    \"am\": 0,\n    \"gear\": 3,\n    \"carb\": 2,\n    \"_row\": \"Dodge Challenger\"\n  },\n  {\n    \"mpg\": 15.2,\n    \"cyl\": 8,\n    \"disp\": 304,\n    \"hp\": 150,\n    \"drat\": 3.15,\n    \"wt\": 3.435,\n    \"qsec\": 17.3,\n    \"vs\": 0,\n    \"am\": 0,\n    \"gear\": 3,\n    \"carb\": 2,\n    \"_row\": \"AMC Javelin\"\n  },\n  {\n    \"mpg\": 13.3,\n    \"cyl\": 8,\n    \"disp\": 350,\n    \"hp\": 245,\n    \"drat\": 3.73,\n    \"wt\": 3.84,\n    \"qsec\": 15.41,\n    \"vs\": 0,\n    \"am\": 0,\n    \"gear\": 3,\n    \"carb\": 4,\n    \"_row\": \"Camaro Z28\"\n  },\n  {\n    \"mpg\": 19.2,\n    \"cyl\": 8,\n    \"disp\": 400,\n    \"hp\": 175,\n    \"drat\": 3.08,\n    \"wt\": 3.845,\n    \"qsec\": 17.05,\n    \"vs\": 0,\n    \"am\": 0,\n    \"gear\": 3,\n    \"carb\": 2,\n    \"_row\": \"Pontiac Firebird\"\n  },\n  {\n    \"mpg\": 27.3,\n    \"cyl\": 4,\n    \"disp\": 79,\n    \"hp\": 66,\n    \"drat\": 4.08,\n    \"wt\": 1.935,\n    \"qsec\": 18.9,\n    \"vs\": 1,\n    \"am\": 1,\n    \"gear\": 4,\n    \"carb\": 1,\n    \"_row\": \"Fiat X1-9\"\n  },\n  {\n    \"mpg\": 26,\n    \"cyl\": 4,\n    \"disp\": 120.3,\n    \"hp\": 91,\n    \"drat\": 4.43,\n    \"wt\": 2.14,\n    \"qsec\": 16.7,\n    \"vs\": 0,\n    \"am\": 1,\n    \"gear\": 5,\n    \"carb\": 2,\n    \"_row\": \"Porsche 914-2\"\n  },\n  {\n    \"mpg\": 30.4,\n    \"cyl\": 4,\n    \"disp\": 95.1,\n    \"hp\": 113,\n    \"drat\": 3.77,\n    \"wt\": 1.513,\n    \"qsec\": 16.9,\n    \"vs\": 1,\n    \"am\": 1,\n    \"gear\": 5,\n    \"carb\": 2,\n    \"_row\": \"Lotus Europa\"\n  },\n  {\n    \"mpg\": 15.8,\n    \"cyl\": 8,\n    \"disp\": 351,\n    \"hp\": 264,\n    \"drat\": 4.22,\n    \"wt\": 3.17,\n    \"qsec\": 14.5,\n    \"vs\": 0,\n    \"am\": 1,\n    \"gear\": 5,\n    \"carb\": 4,\n    \"_row\": \"Ford Pantera L\"\n  },\n  {\n    \"mpg\": 19.7,\n    \"cyl\": 6,\n    \"disp\": 145,\n    \"hp\": 175,\n    \"drat\": 3.62,\n    \"wt\": 2.77,\n    \"qsec\": 15.5,\n    \"vs\": 0,\n    \"am\": 1,\n    \"gear\": 5,\n    \"carb\": 6,\n    \"_row\": \"Ferrari Dino\"\n  },\n  {\n    \"mpg\": 15,\n    \"cyl\": 8,\n    \"disp\": 301,\n    \"hp\": 335,\n    \"drat\": 3.54,\n    \"wt\": 3.57,\n    \"qsec\": 14.6,\n    \"vs\": 0,\n    \"am\": 1,\n    \"gear\": 5,\n    \"carb\": 8,\n    \"_row\": \"Maserati Bora\"\n  },\n  {\n    \"mpg\": 21.4,\n    \"cyl\": 4,\n    \"disp\": 121,\n    \"hp\": 109,\n    \"drat\": 4.11,\n    \"wt\": 2.78,\n    \"qsec\": 18.6,\n    \"vs\": 1,\n    \"am\": 1,\n    \"gear\": 4,\n    \"carb\": 2,\n    \"_row\": \"Volvo 142E\"\n  }\n]"
# Minify pretty_json: mini_json
mini_json <- minify(pretty_json) 

# Print mini_json
head(mini_json)
## [1] "[{\"mpg\":21,\"cyl\":6,\"disp\":160,\"hp\":110,\"drat\":3.9,\"wt\":2.62,\"qsec\":16.46,\"vs\":0,\"am\":1,\"gear\":4,\"carb\":4,\"_row\":\"Mazda RX4\"},{\"mpg\":21,\"cyl\":6,\"disp\":160,\"hp\":110,\"drat\":3.9,\"wt\":2.875,\"qsec\":17.02,\"vs\":0,\"am\":1,\"gear\":4,\"carb\":4,\"_row\":\"Mazda RX4 Wag\"},{\"mpg\":22.8,\"cyl\":4,\"disp\":108,\"hp\":93,\"drat\":3.85,\"wt\":2.32,\"qsec\":18.61,\"vs\":1,\"am\":1,\"gear\":4,\"carb\":1,\"_row\":\"Datsun 710\"},{\"mpg\":21.4,\"cyl\":6,\"disp\":258,\"hp\":110,\"drat\":3.08,\"wt\":3.215,\"qsec\":19.44,\"vs\":1,\"am\":0,\"gear\":3,\"carb\":1,\"_row\":\"Hornet 4 Drive\"},{\"mpg\":18.7,\"cyl\":8,\"disp\":360,\"hp\":175,\"drat\":3.15,\"wt\":3.44,\"qsec\":17.02,\"vs\":0,\"am\":0,\"gear\":3,\"carb\":2,\"_row\":\"Hornet Sportabout\"},{\"mpg\":18.1,\"cyl\":6,\"disp\":225,\"hp\":105,\"drat\":2.76,\"wt\":3.46,\"qsec\":20.22,\"vs\":1,\"am\":0,\"gear\":3,\"carb\":1,\"_row\":\"Valiant\"},{\"mpg\":14.3,\"cyl\":8,\"disp\":360,\"hp\":245,\"drat\":3.21,\"wt\":3.57,\"qsec\":15.84,\"vs\":0,\"am\":0,\"gear\":3,\"carb\":4,\"_row\":\"Duster 360\"},{\"mpg\":24.4,\"cyl\":4,\"disp\":146.7,\"hp\":62,\"drat\":3.69,\"wt\":3.19,\"qsec\":20,\"vs\":1,\"am\":0,\"gear\":4,\"carb\":2,\"_row\":\"Merc 240D\"},{\"mpg\":22.8,\"cyl\":4,\"disp\":140.8,\"hp\":95,\"drat\":3.92,\"wt\":3.15,\"qsec\":22.9,\"vs\":1,\"am\":0,\"gear\":4,\"carb\":2,\"_row\":\"Merc 230\"},{\"mpg\":19.2,\"cyl\":6,\"disp\":167.6,\"hp\":123,\"drat\":3.92,\"wt\":3.44,\"qsec\":18.3,\"vs\":1,\"am\":0,\"gear\":4,\"carb\":4,\"_row\":\"Merc 280\"},{\"mpg\":17.8,\"cyl\":6,\"disp\":167.6,\"hp\":123,\"drat\":3.92,\"wt\":3.44,\"qsec\":18.9,\"vs\":1,\"am\":0,\"gear\":4,\"carb\":4,\"_row\":\"Merc 280C\"},{\"mpg\":16.4,\"cyl\":8,\"disp\":275.8,\"hp\":180,\"drat\":3.07,\"wt\":4.07,\"qsec\":17.4,\"vs\":0,\"am\":0,\"gear\":3,\"carb\":3,\"_row\":\"Merc 450SE\"},{\"mpg\":17.3,\"cyl\":8,\"disp\":275.8,\"hp\":180,\"drat\":3.07,\"wt\":3.73,\"qsec\":17.6,\"vs\":0,\"am\":0,\"gear\":3,\"carb\":3,\"_row\":\"Merc 450SL\"},{\"mpg\":15.2,\"cyl\":8,\"disp\":275.8,\"hp\":180,\"drat\":3.07,\"wt\":3.78,\"qsec\":18,\"vs\":0,\"am\":0,\"gear\":3,\"carb\":3,\"_row\":\"Merc 450SLC\"},{\"mpg\":10.4,\"cyl\":8,\"disp\":472,\"hp\":205,\"drat\":2.93,\"wt\":5.25,\"qsec\":17.98,\"vs\":0,\"am\":0,\"gear\":3,\"carb\":4,\"_row\":\"Cadillac Fleetwood\"},{\"mpg\":10.4,\"cyl\":8,\"disp\":460,\"hp\":215,\"drat\":3,\"wt\":5.424,\"qsec\":17.82,\"vs\":0,\"am\":0,\"gear\":3,\"carb\":4,\"_row\":\"Lincoln Continental\"},{\"mpg\":14.7,\"cyl\":8,\"disp\":440,\"hp\":230,\"drat\":3.23,\"wt\":5.345,\"qsec\":17.42,\"vs\":0,\"am\":0,\"gear\":3,\"carb\":4,\"_row\":\"Chrysler Imperial\"},{\"mpg\":32.4,\"cyl\":4,\"disp\":78.7,\"hp\":66,\"drat\":4.08,\"wt\":2.2,\"qsec\":19.47,\"vs\":1,\"am\":1,\"gear\":4,\"carb\":1,\"_row\":\"Fiat 128\"},{\"mpg\":30.4,\"cyl\":4,\"disp\":75.7,\"hp\":52,\"drat\":4.93,\"wt\":1.615,\"qsec\":18.52,\"vs\":1,\"am\":1,\"gear\":4,\"carb\":2,\"_row\":\"Honda Civic\"},{\"mpg\":33.9,\"cyl\":4,\"disp\":71.1,\"hp\":65,\"drat\":4.22,\"wt\":1.835,\"qsec\":19.9,\"vs\":1,\"am\":1,\"gear\":4,\"carb\":1,\"_row\":\"Toyota Corolla\"},{\"mpg\":21.5,\"cyl\":4,\"disp\":120.1,\"hp\":97,\"drat\":3.7,\"wt\":2.465,\"qsec\":20.01,\"vs\":1,\"am\":0,\"gear\":3,\"carb\":1,\"_row\":\"Toyota Corona\"},{\"mpg\":15.5,\"cyl\":8,\"disp\":318,\"hp\":150,\"drat\":2.76,\"wt\":3.52,\"qsec\":16.87,\"vs\":0,\"am\":0,\"gear\":3,\"carb\":2,\"_row\":\"Dodge Challenger\"},{\"mpg\":15.2,\"cyl\":8,\"disp\":304,\"hp\":150,\"drat\":3.15,\"wt\":3.435,\"qsec\":17.3,\"vs\":0,\"am\":0,\"gear\":3,\"carb\":2,\"_row\":\"AMC Javelin\"},{\"mpg\":13.3,\"cyl\":8,\"disp\":350,\"hp\":245,\"drat\":3.73,\"wt\":3.84,\"qsec\":15.41,\"vs\":0,\"am\":0,\"gear\":3,\"carb\":4,\"_row\":\"Camaro Z28\"},{\"mpg\":19.2,\"cyl\":8,\"disp\":400,\"hp\":175,\"drat\":3.08,\"wt\":3.845,\"qsec\":17.05,\"vs\":0,\"am\":0,\"gear\":3,\"carb\":2,\"_row\":\"Pontiac Firebird\"},{\"mpg\":27.3,\"cyl\":4,\"disp\":79,\"hp\":66,\"drat\":4.08,\"wt\":1.935,\"qsec\":18.9,\"vs\":1,\"am\":1,\"gear\":4,\"carb\":1,\"_row\":\"Fiat X1-9\"},{\"mpg\":26,\"cyl\":4,\"disp\":120.3,\"hp\":91,\"drat\":4.43,\"wt\":2.14,\"qsec\":16.7,\"vs\":0,\"am\":1,\"gear\":5,\"carb\":2,\"_row\":\"Porsche 914-2\"},{\"mpg\":30.4,\"cyl\":4,\"disp\":95.1,\"hp\":113,\"drat\":3.77,\"wt\":1.513,\"qsec\":16.9,\"vs\":1,\"am\":1,\"gear\":5,\"carb\":2,\"_row\":\"Lotus Europa\"},{\"mpg\":15.8,\"cyl\":8,\"disp\":351,\"hp\":264,\"drat\":4.22,\"wt\":3.17,\"qsec\":14.5,\"vs\":0,\"am\":1,\"gear\":5,\"carb\":4,\"_row\":\"Ford Pantera L\"},{\"mpg\":19.7,\"cyl\":6,\"disp\":145,\"hp\":175,\"drat\":3.62,\"wt\":2.77,\"qsec\":15.5,\"vs\":0,\"am\":1,\"gear\":5,\"carb\":6,\"_row\":\"Ferrari Dino\"},{\"mpg\":15,\"cyl\":8,\"disp\":301,\"hp\":335,\"drat\":3.54,\"wt\":3.57,\"qsec\":14.6,\"vs\":0,\"am\":1,\"gear\":5,\"carb\":8,\"_row\":\"Maserati Bora\"},{\"mpg\":21.4,\"cyl\":4,\"disp\":121,\"hp\":109,\"drat\":4.11,\"wt\":2.78,\"qsec\":18.6,\"vs\":1,\"am\":1,\"gear\":4,\"carb\":2,\"_row\":\"Volvo 142E\"}]"

Importing data from statistical software packages

haven

Import SAS data with haven

haven is an extremely easy-to-use package to import data from three software packages: SAS, STATA and SPSS. Depending on the software, you use different functions:

SAS: read_sas() STATA: read_dta() (or read_stata(), which are identical) SPSS: read_sav() or read_por(), depending on the file type. All these functions take one key argument: the path to your local file. In fact, you can even pass a URL; haven will then automatically download the file for you before importing it.

You’ll be working with data on the age, gender, income, and purchase level (0 = low, 1 = high) of 36 individuals (Source: SAS). The information is stored in a SAS file, sales.sas7bdat, which is available in your current working directory. You can also download the data here

TO DO.

Load the haven package; it’s already installed on DataCamp’s servers. Import the data file “sales.sas7bdat”. Call the imported data frame sales. Display the structure of sales with str(). Some columns represent categorical variables, so they should be factors.

# Load the haven package
library(haven)

# Import sales.sas7bdat: sales
sales <- read_sas("sales.sas7bdat")

# Display the structure of sales
str(sales)
## tibble [431 x 4] (S3: tbl_df/tbl/data.frame)
##  $ purchase: num [1:431] 0 0 1 1 0 0 0 0 0 0 ...
##  $ age     : num [1:431] 41 47 41 39 32 32 33 45 43 40 ...
##  $ gender  : chr [1:431] "Female" "Female" "Female" "Female" ...
##  $ income  : chr [1:431] "Low" "Low" "Low" "Low" ...
##  - attr(*, "label")= chr "SALES"

Next up are STATA data files; you can use read_dta() for these.

When inspecting the result of the read_dta() call, you will notice that one column will be imported as a labelled vector, an R equivalent for the common data structure in other statistical environments. In order to effectively continue working on the data in R, it’s best to change this data into a standard R class. To convert a variable of the class labelled to a factor, you’ll need haven’s as_factor() function.

In this exercise, you will work with data on yearly import and export numbers of sugar, both in USD and in weight. The data can be found at: http://assets.datacamp.com/production/course_1478/datasets/trade.dta

TO DO:

Import the data file directly from the URL using read_dta(), and store it as sugar. Print out the structure of sugar. The Date column has class labelled. Convert the values in the Date column of sugar to dates, using as.Date(as_factor(___)). Print out the structure of sugar once more. Looks better now?

# haven is already loaded

# Import the data from the URL: sugar
sugar <- read_dta("http://assets.datacamp.com/production/course_1478/datasets/trade.dta")

# Structure of sugar
str(sugar)
## tibble [10 x 5] (S3: tbl_df/tbl/data.frame)
##  $ Date    : dbl+lbl [1:10] 10,  9,  8,  7,  6,  5,  4,  3,  2,  1
##    ..@ label       : chr "Date"
##    ..@ format.stata: chr "%9.0g"
##    ..@ labels      : Named num [1:10] 1 2 3 4 5 6 7 8 9 10
##    .. ..- attr(*, "names")= chr [1:10] "2004-12-31" "2005-12-31" "2006-12-31" "2007-12-31" ...
##  $ Import  : num [1:10] 37664782 16316512 11082246 35677943 9879878 ...
##   ..- attr(*, "label")= chr "Import"
##   ..- attr(*, "format.stata")= chr "%9.0g"
##  $ Weight_I: num [1:10] 54029106 21584365 14526089 55034932 14806865 ...
##   ..- attr(*, "label")= chr "Weight_I"
##   ..- attr(*, "format.stata")= chr "%9.0g"
##  $ Export  : num [1:10] 5.45e+07 1.03e+08 3.79e+07 4.85e+07 7.15e+07 ...
##   ..- attr(*, "label")= chr "Export"
##   ..- attr(*, "format.stata")= chr "%9.0g"
##  $ Weight_E: num [1:10] 9.34e+07 1.58e+08 8.80e+07 1.12e+08 1.32e+08 ...
##   ..- attr(*, "label")= chr "Weight_E"
##   ..- attr(*, "format.stata")= chr "%9.0g"
##  - attr(*, "label")= chr "Written by R."
# Convert values in Date column to dates
sugar$Date <- as.Date(as_factor(sugar$Date))

# Structure of sugar again
str(sugar)
## tibble [10 x 5] (S3: tbl_df/tbl/data.frame)
##  $ Date    : Date[1:10], format: "2013-12-31" "2012-12-31" ...
##  $ Import  : num [1:10] 37664782 16316512 11082246 35677943 9879878 ...
##   ..- attr(*, "label")= chr "Import"
##   ..- attr(*, "format.stata")= chr "%9.0g"
##  $ Weight_I: num [1:10] 54029106 21584365 14526089 55034932 14806865 ...
##   ..- attr(*, "label")= chr "Weight_I"
##   ..- attr(*, "format.stata")= chr "%9.0g"
##  $ Export  : num [1:10] 5.45e+07 1.03e+08 3.79e+07 4.85e+07 7.15e+07 ...
##   ..- attr(*, "label")= chr "Export"
##   ..- attr(*, "format.stata")= chr "%9.0g"
##  $ Weight_E: num [1:10] 9.34e+07 1.58e+08 8.80e+07 1.12e+08 1.32e+08 ...
##   ..- attr(*, "label")= chr "Weight_E"
##   ..- attr(*, "format.stata")= chr "%9.0g"
##  - attr(*, "label")= chr "Written by R."

Import SPSS data with haven

The haven package can also import data files from SPSS. Again, importing the data is pretty straightforward. Depending on the SPSS data file you’re working with, you’ll need either read_sav() - for .sav files - or read_por() - for .por files.

In this exercise, you will work with data on four of the Big Five personality traits for 434 persons (Source: University of Bath). The Big Five is a psychological concept including, originally, five dimensions of personality to classify human personality. The SPSS dataset is called person.sav and is available in your working directory.

TO DO :

Use read_sav() to import the SPSS data in “person.sav”. Name the imported data frame traits. traits contains several missing values, or NAs. Run summary() on it to find out how many NAs are contained in each variable. Print out a subset of those individuals that scored high on Extroversion and on Agreeableness, i.e. scoring higher than 40 on each of these two categories. You can use subset() for this.

Factorize, round two

In the last exercise you learned how to import a data file using the command read_sav(). With SPSS data files, it can also happen that some of the variables you import have the labelled class. This is done to keep all the labelling information that was originally present in the .sav and .por files. It’s advised to coerce (or change) these variables to factors or other standard R classes.

The data for this exercise involves information on employees and their demographic and economic attributes (Source: QRiE). The data can be found on the following URL:

http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/employee.sav

TO DO:

Import the SPSS data straight from the URL and store the resulting data frame as work. Display the summary of the GENDER column of work. This information doesn’t give you a lot of useful information, right? Convert the GENDER column in work to a factor, the class to denote categorical variables in R. Use as_factor(). Once again display the summary of the GENDER column. This time, the printout makes much more sense.

# haven is already loaded

# Import SPSS data from the URL: work
work <- read_sav("http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/employee.sav")

# Display summary of work$GENDER
summary(work$GENDER)
##    Length     Class      Mode 
##       474 character character
# Convert work$GENDER to a factor
work$GENDER <- as_factor(work$GENDER)

# Display summary of work$GENDER again
summary(work$GENDER)
## Female   Male 
##    216    258

foreign

Import STATA data with foreign (1)

The foreign package offers a simple function to import and read STATA data: read.dta().

In this exercise, you will import data on the US presidential elections in the year 2000. The data in florida.dta contains the total numbers of votes for each of the four candidates as well as the total number of votes per election area in the state of Florida (Source: Florida Department of State). The file is available in your working directory, you can download it here if you want to experiment some more.

TO DO:

Load the foreign package; it’s already installed on DataCamp’s servers. Import the data on the elections in Florida, “florida.dta”, and name the resulting data frame florida. Use read.dta() without specifying extra arguments. Check out the last 6 observations of florida with tail()

# Load the foreign package
#library(foreign)

# Import florida.dta and name the resulting data frame florida
#florida <- read.dta("florida.dta")

# Check tail() of florida
#tail(florida)

Import STATA data with foreign (2)

Data can be very diverse, going from character vectors to categorical variables, dates and more. It’s in these cases that the additional arguments of read.dta() will come in handy.

The arguments you will use most often are convert.dates, convert.factors, missing.type and convert.underscore. Their meaning is pretty straightforward, as Filip explained in the video. It’s all about correctly converting STATA data to standard R data structures. Type ?read.dta to find out about about the default values.

The dataset for this exercise contains socio-economic measures and access to education for different individuals (Source: World Bank). This data is available as edequality.dta, which is located in the worldbank folder in your working directory.

TO DO:

Specify the path to the file using file.path(). Call it path. Remember the “edequality.dta” file is located in the “worldbank” folder. Use the path variable to import the data file in three different ways; each time show its structure with str(): edu_equal_1: By passing only the file path to read.dta(). edu_equal_2: By passing the file path, and setting convert.factors to FALSE. edu_equal_3: By passing the file path, and setting convert.underscore to TRUE.

Import SPSS data with foreign (1)

All great things come in pairs. Where foreign provided read.dta() to read Stata data, there’s also read.spss() to read SPSS data files. To get a data frame, make sure to set to.data.frame = TRUE inside read.spss().

In this exercise, you’ll be working with socio-economic variables from different countries (Source: Quantative Data Analysis in Education). The SPSS data is in a file called international.sav, which is in your working directory. You can also download it here if you want to play around with it some more.

TO DO:

Import the data file “international.sav” and have R convert it to a data frame. Store this data frame as demo. Create a boxplot of the gdp variable of demo.

# foreign is already loaded
library(foreign)
# Import international.sav as a data frame: demo
demo <- read.spss("international.sav", to.data.frame = TRUE)

# Create boxplot of gdp variable of demo
boxplot(demo$gdp)

Import SPSS data with foreign (2)

In the previous exercise, you used the to.data.frame argument inside read.spss(). There are many other ways in which to customize the way your SPSS data is imported.

In this exercise you will experiment with another argument, use.value.labels. It specifies whether variables with value labels should be converted into R factors with levels that are named accordingly. The argument is TRUE by default which means that so called labelled variables inside SPSS are converted to factors inside R.

You’ll again be working with the international.sav data, which is available in your current working directory.

TO DO.

Import the data file “international.sav” as a data frame, demo_1. Print the first few rows of demo_1 using the head() function. Import the data file “international.sav” as a data frame, demo_2, but this time in a way such that variables with value labels are not converted to R factors. Again, print the first few rows of demo_2. Can you tell the difference between the two data frames?

# foreign is already loaded

# Import international.sav as demo_1
demo_1 <- read.spss("international.sav", to.data.frame = TRUE)

# Print out the head of demo_1
head(demo_1)
##   id              country  contint m_illit f_illit lifeexpt  gdp
## 1  1 Argentina            Americas     3.0     3.0       16 3375
## 2  2 Benin                  Africa    45.2    74.5        7  521
## 3  3 Burundi                Africa    33.2    48.1        5   86
## 4  4 Chile                Americas     4.2     4.4       14 4523
## 5  5 Dominican Republic   Americas    12.0    12.7       12 2408
## 6  6 El Salvador          Americas    17.6    22.9       11 2302
# Import international.sav as demo_2
demo_2 <- read.spss("international.sav", to.data.frame = TRUE, use.value.labels = FALSE)

# Print out the head of demo_2
head(demo_2)
##   id              country contint m_illit f_illit lifeexpt  gdp
## 1  1 Argentina                  2     3.0     3.0       16 3375
## 2  2 Benin                      1    45.2    74.5        7  521
## 3  3 Burundi                    1    33.2    48.1        5   86
## 4  4 Chile                      2     4.2     4.4       14 4523
## 5  5 Dominican Republic         2    12.0    12.7       12 2408
## 6  6 El Salvador                2    17.6    22.9       11 2302

10 .- Cleaning Data in R

Common Data Problems

Data type constraints

Converting data types

Throughout this chapter, you’ll be working with San Francisco bike share ride data called bike_share_rides. It contains information on start and end stations of each trip, the trip duration, and some user information.

Before beginning to analyze any dataset, it’s important to take a look at the different types of columns you’ll be working with, which you can do using glimpse().

In this exercise, you’ll take a look at the data types contained in bike_share_rides and see how an incorrect data type can flaw your analysis.

dplyr and assertive are loaded and bike_share_rides is available.

EXERCISES

Examine the data types of the columns of bike_share_rides.

bike_share_rides <- readRDS("./Data/bike_share_rides_ch1_1.rds")

Get a summary of the user_birth_year column of bike_share_rides.

# Glimpse at bike_share_rides
glimpse(bike_share_rides)
## Rows: 35,229
## Columns: 10
## $ ride_id         <int> 52797, 54540, 87695, 45619, 70832, 96135, 29928, 83...
## $ date            <chr> "2017-04-15", "2017-04-19", "2017-04-14", "2017-04-...
## $ duration        <chr> "1316.15 minutes", "8.13 minutes", "24.85 minutes",...
## $ station_A_id    <dbl> 67, 21, 16, 58, 16, 6, 5, 16, 5, 81, 30, 16, 16, 67...
## $ station_A_name  <chr> "San Francisco Caltrain Station 2  (Townsend St at ...
## $ station_B_id    <dbl> 89, 64, 355, 368, 81, 66, 350, 91, 62, 81, 109, 10,...
## $ station_B_name  <chr> "Division St at Potrero Ave", "5th St at Brannan St...
## $ bike_id         <dbl> 1974, 860, 2263, 1417, 507, 75, 388, 239, 1449, 328...
## $ user_gender     <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Ma...
## $ user_birth_year <dbl> 1972, 1986, 1993, 1981, 1981, 1988, 1993, 1996, 199...
# Summary of user_birth_year
summary(bike_share_rides$user_birth_year)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1900    1979    1986    1984    1991    2001

Add a new column to bike_share_rides called user_birth_year_fct, which contains user_birth_year, converted to a factor.

# Glimpse at bike_share_rides
glimpse(bike_share_rides)
## Rows: 35,229
## Columns: 10
## $ ride_id         <int> 52797, 54540, 87695, 45619, 70832, 96135, 29928, 83...
## $ date            <chr> "2017-04-15", "2017-04-19", "2017-04-14", "2017-04-...
## $ duration        <chr> "1316.15 minutes", "8.13 minutes", "24.85 minutes",...
## $ station_A_id    <dbl> 67, 21, 16, 58, 16, 6, 5, 16, 5, 81, 30, 16, 16, 67...
## $ station_A_name  <chr> "San Francisco Caltrain Station 2  (Townsend St at ...
## $ station_B_id    <dbl> 89, 64, 355, 368, 81, 66, 350, 91, 62, 81, 109, 10,...
## $ station_B_name  <chr> "Division St at Potrero Ave", "5th St at Brannan St...
## $ bike_id         <dbl> 1974, 860, 2263, 1417, 507, 75, 388, 239, 1449, 328...
## $ user_gender     <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Ma...
## $ user_birth_year <dbl> 1972, 1986, 1993, 1981, 1981, 1988, 1993, 1996, 199...
# Summary of user_birth_year
summary(bike_share_rides$user_birth_year)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1900    1979    1986    1984    1991    2001
# Convert user_birth_year to factor: user_birth_year_fct
bike_share_rides <- bike_share_rides %>%
  mutate(user_birth_year_fct = as.factor(user_birth_year))

Assert that the user_birth_year_fct is a factor to confirm the conversion.

# Assert user_birth_year_fct is a factor
assert_is_factor(bike_share_rides$user_birth_year_fct)

# Summary of user_birth_year_fct
summary(bike_share_rides$user_birth_year_fct)
## 1900 1902 1923 1931 1938 1939 1941 1942 1943 1945 1946 1947 1948 1949 1950 1951 
##    1    7    2   23    2    1    3   10    4   16    5   24    9   30   37   25 
## 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 
##   70   49   65   66  112   62  156   99  196  161  256  237  245  349  225  363 
## 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 
##  365  331  370  548  529  527  563  601  481  541  775  876  825 1016 1056 1262 
## 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 
## 1157 1318 1606 1672 2135 1872 2062 1582 1703 1498 1476 1185  813  358  365  348 
## 2000 2001 
##  473   30

Trimming strings

In the previous exercise, you were able to identify the correct data type and convert user_birth_year to the correct type, allowing you to extract counts that gave you a bit more insight into the dataset.

Another common dirty data problem is having extra bits like percent signs or periods in numbers, causing them to be read in as characters. In order to be able to crunch these numbers, the extra bits need to be removed and the numbers need to be converted from character to numeric. In this exercise, you’ll need to convert the duration column from character to numeric, but before this can happen, the word “minutes” needs to be removed from each value.

dplyr, assertive, and stringr are loaded and bike_share_rides is available.

EXERCISES.

Use str_remove() to remove “minutes” from the duration column of bike_share_rides. Add this as a new column called duration_trimmed.

bike_share_rides <- bike_share_rides %>%
  # Remove 'minutes' from duration: duration_trimmed
  mutate(duration_trimmed = str_remove(duration, "minutes"),
         # Convert duration_trimmed to numeric: duration_mins
         duration_mins = as.numeric(duration_trimmed))

Convert the duration_trimmed column to a numeric type and add this as a new column called duration_mins.

Glimpse at bike_share_rides and assert that the duration_mins column is numeric.

# Glimpse at bike_share_rides
glimpse(bike_share_rides)
## Rows: 35,229
## Columns: 13
## $ ride_id             <int> 52797, 54540, 87695, 45619, 70832, 96135, 29928...
## $ date                <chr> "2017-04-15", "2017-04-19", "2017-04-14", "2017...
## $ duration            <chr> "1316.15 minutes", "8.13 minutes", "24.85 minut...
## $ station_A_id        <dbl> 67, 21, 16, 58, 16, 6, 5, 16, 5, 81, 30, 16, 16...
## $ station_A_name      <chr> "San Francisco Caltrain Station 2  (Townsend St...
## $ station_B_id        <dbl> 89, 64, 355, 368, 81, 66, 350, 91, 62, 81, 109,...
## $ station_B_name      <chr> "Division St at Potrero Ave", "5th St at Branna...
## $ bike_id             <dbl> 1974, 860, 2263, 1417, 507, 75, 388, 239, 1449,...
## $ user_gender         <chr> "Male", "Male", "Male", "Male", "Male", "Male",...
## $ user_birth_year     <dbl> 1972, 1986, 1993, 1981, 1981, 1988, 1993, 1996,...
## $ user_birth_year_fct <fct> 1972, 1986, 1993, 1981, 1981, 1988, 1993, 1996,...
## $ duration_trimmed    <chr> "1316.15 ", "8.13 ", "24.85 ", "6.35 ", "9.8 ",...
## $ duration_mins       <dbl> 1316.15, 8.13, 24.85, 6.35, 9.80, 17.47, 16.52,...
# Assert duration_mins is numeric
assert_is_numeric(bike_share_rides$duration_mins)

Calculate the mean of duration_mins.

# Calculate mean duration
mean(bike_share_rides$duration_mins)
## [1] 13.06214

Range constraints

Ride duration constraints

Values that are out of range can throw off an analysis, so it’s important to catch them early on. In this exercise, you’ll be examining the duration_min column more closely. Bikes are not allowed to be kept out for more than 24 hours, or 1440 minutes at a time, but issues with some of the bikes caused inaccurate recording of the time they were returned.

In this exercise, you’ll replace erroneous data with the range limit (1440 minutes), however, you could just as easily replace these values with NAs.

dplyr, assertive, and ggplot2 are loaded and bike_share_rides is available.

EXERCISES.
# Create breaks
breaks <- c(min(bike_share_rides$duration_min), 0, 1440, max(bike_share_rides$duration_min))
## Warning: Unknown or uninitialised column: `duration_min`.
## Warning in min(bike_share_rides$duration_min): ningún argumento finito para min;
## retornando Inf
## Warning: Unknown or uninitialised column: `duration_min`.
## Warning in max(bike_share_rides$duration_min): ningun argumento finito para max;
## retornando -Inf
colnames(bike_share_rides)[13] <- "duration_min" 
# Create a histogram of duration_min
ggplot(bike_share_rides, aes(duration_min)) +
  geom_histogram(breaks = breaks)
## Warning: Computation failed in `stat_bin()`:
## 'breaks' are not unique

Replace the values of duration_min that are greater than 1440 minutes (24 hours) with 1440. Add this to bike_share_rides as a new column called duration_min_const.

# Create breaks
breaks <- c(min(bike_share_rides$duration_min), 0, 1440, max(bike_share_rides$duration_min))
# Create a histogram of duration_min
ggplot(bike_share_rides, aes(duration_min)) +
  geom_histogram(breaks = breaks)

# duration_min_const: replace vals of duration_min > 1440 with 1440
bike_share_rides <- bike_share_rides %>%
  mutate(duration_min_const = replace(duration_min, duration_min > 1440, 1440))

Assert that all values of duration_min_const are between 0 and 1440.

# Make sure all values of duration_min_const are between 0 and 1440
assert_all_are_in_closed_range(bike_share_rides$duration_min_const, lower = 0, upper = 1440)

Back to the future

Something has gone wrong and it looks like you have data with dates from the future, which is way outside of the date range you expected to be working with. To fix this, you’ll need to remove any rides from the dataset that have a date in the future. Before you can do this, the date column needs to be converted from a character to a Date. Having these as Date objects will make it much easier to figure out which rides are from the future, since R makes it easy to check if one Date object is before (<) or after (>) another.

dplyr and assertive are loaded and bike_share_rides is available.

EXERCISEs:

Convert the date column of bike_share_rides from character to the Date data type.

library(lubridate)
# Convert date to Date type
bike_share_rides <- bike_share_rides %>%
  mutate(date = as.Date(date))

Assert that all values in the date column happened sometime in the past and not in the future.

# Make sure all dates are in the past
assert_all_are_in_past(bike_share_rides$date)
## Warning: Coercing bike_share_rides$date to class 'POSIXct'.

Filter bike_share_rides to get only the rides from the past or today, and save this as bike_share_rides_past.

# Filter for rides that occurred before or on today's date
bike_share_rides_past <- bike_share_rides %>%
  filter(date <= today())

Assert that the dates in bike_share_rides_past occurred only in the past.

# Make sure all dates from bike_share_rides_past are in the past
assert_all_are_in_past(bike_share_rides_past$date)
## Warning: Coercing bike_share_rides_past$date to class 'POSIXct'.

Uniqueness constraints

Full duplicates

You’ve been notified that an update has been made to the bike sharing data pipeline to make it more efficient, but that duplicates are more likely to be generated as a result. To make sure that you can continue using the same scripts to run your weekly analyses about ride statistics, you’ll need to ensure that any duplicates in the dataset are removed first.

When multiple rows of a data frame share the same values for all columns, they’re full duplicates of each other. Removing duplicates like this is important, since having the same value repeated multiple times can alter summary statistics like the mean and median. Each ride, including its ride_id should be unique.

dplyr is loaded and bike_share_rides is available.

EXERCISES:

Get the total number of full duplicates in bike_share_rides.

# Count the number of full duplicates
sum(duplicated(bike_share_rides))
## [1] 0

Remove all full duplicates from bike_share_rides and save the new data frame as bike_share_rides_unique.

# Remove duplicates
bike_share_rides_unique <- distinct(bike_share_rides)

Get the total number of full duplicates in the new bike_share_rides_unique data frame.

# Count the full duplicates in bike_share_rides_unique
sum(duplicated(bike_share_rides_unique))
## [1] 0

Removing partial duplicates

Now that you’ve identified and removed the full duplicates, it’s time to check for partial duplicates. Partial duplicates are a bit tricker to deal with than full duplicates. In this exercise, you’ll first identify any partial duplicates and then practice the most common technique to deal with them, which involves dropping all partial duplicates, keeping only the first.

dplyr is loaded and bike_share_rides is available.

EXERCISES:

Count the number of occurrences of each ride_id.

# Find duplicated ride_ids
bike_share_rides %>% 
  # Count the number of occurrences of each ride_id
  count(ride_id) %>% 
  # Filter for rows with a count > 1
  filter(n>1)
## # A tibble: 0 x 2
## # ... with 2 variables: ride_id <int>, n <int>

Remove full and partial duplicates from bike_share_rides based on ride_id only, keeping all columns.

# Find duplicated ride_ids
bike_share_rides %>% 
  count(ride_id) %>% 
  filter(n > 1)
## # A tibble: 0 x 2
## # ... with 2 variables: ride_id <int>, n <int>

Store this as bike_share_rides_unique.

# Remove full and partial duplicates
bike_share_rides_unique <- bike_share_rides %>%
  # Only based on ride_id instead of all cols
  distinct(ride_id, .keep_all = TRUE)

Aggregating partial duplicates

Another way of handling partial duplicates is to compute a summary statistic of the values that differ between partial duplicates, such as mean, median, maximum, or minimum. This can come in handy when you’re not sure how your data was collected and want an average, or if based on domain knowledge, you’d rather have too high of an estimate than too low of an estimate (or vice versa).

dplyr is loaded and bike_share_rides is available.

EXERCISES.

Group bike_share_rides by ride_id and date.

Add a column called duration_min_avg that contains the mean ride duration for the row’s ride_id and date.

Remove duplicates based on ride_id and date, keeping all columns of the data frame.

Remove the duration_min column.

asdf <- bike_share_rides %>%
  # Group by ride_id and date
  group_by(ride_id, date) %>%
  # Add duration_min_avg column
  mutate(duration_min_avg = mean(duration_min)) %>%
  # Remove duplicates based on ride_id and date, keep all cols
  distinct(ride_id, date, .keep_all = TRUE) %>%
  # Remove duration_min column
  select(-duration_min) %>% top_n(n=10)
## Selecting by duration_min_avg
head(asdf)
## # A tibble: 6 x 14
## # Groups:   ride_id, date [6]
##   ride_id date       duration station_A_id station_A_name station_B_id
##     <int> <date>     <chr>           <dbl> <chr>                 <dbl>
## 1   52797 2017-04-15 1316.15~           67 San Francisco~           89
## 2   54540 2017-04-19 8.13 mi~           21 Montgomery St~           64
## 3   87695 2017-04-14 24.85 m~           16 Steuart St at~          355
## 4   45619 2017-04-03 6.35 mi~           58 Market St at ~          368
## 5   70832 2017-04-10 9.8 min~           16 Steuart St at~           81
## 6   96135 2017-04-18 17.47 m~            6 The Embarcade~           66
## # ... with 8 more variables: station_B_name <chr>, bike_id <dbl>,
## #   user_gender <chr>, user_birth_year <dbl>, user_birth_year_fct <fct>,
## #   duration_trimmed <chr>, duration_min_const <dbl>, duration_min_avg <dbl>

Categorical and Text Data

Checking membership

Not a member

Now that you’ve practiced identifying membership constraint problems, it’s time to fix these problems in a new dataset. Throughout this chapter, you’ll be working with a dataset called sfo_survey, containing survey responses from passengers taking flights from San Francisco International Airport (SFO). Participants were asked questions about the airport’s cleanliness, wait times, safety, and their overall satisfaction.

There were a few issues during data collection that resulted in some inconsistencies in the dataset. In this exercise, you’ll be working with the dest_size column, which categorizes the size of the destination airport that the passengers were flying to. A data frame called dest_sizes is available that contains all the possible destination sizes. Your mission is to find rows with invalid dest_sizes and remove them from the data frame.

dplyr has been loaded and sfo_survey and dest_sizes are available.

sfo_survey <- readRDS("./Data/sfo_survey_ch2_1.rds")
EXERCISES:

Count the number of occurrences of each dest_size in sfo_survey.

# Count the number of occurrences of dest_size
sfo_survey %>%
  count(dest_size)
##   dest_size    n
## 1   Small      1
## 2       Hub    1
## 3       Hub 1756
## 4     Large  143
## 5   Large      1
## 6    Medium  682
## 7     Small  225

Categorical data problems

Identifying inconsistency

In the video exercise, you learned about different kinds of inconsistencies that can occur within categories, making it look like a variable has more categories than it should.

In this exercise, you’ll continue working with the sfo_survey dataset. You’ll examine the dest_size column again as well as the cleanliness column and determine what kind of issues, if any, these two categorical variables face.

dplyr and is loaded and sfo_survey is available.

EXERCISES:

Count the number of occurrences of each category of the dest_size variable of sfo_survey.

# Count dest_size
sfo_survey %>%
  count(dest_size)
##   dest_size    n
## 1   Small      1
## 2       Hub    1
## 3       Hub 1756
## 4     Large  143
## 5   Large      1
## 6    Medium  682
## 7     Small  225

Count the number of occurrences of each category of the cleanliness variable of sfo_survey.

# Count cleanliness
sfo_survey %>%
  count(cleanliness)
##      cleanliness    n
## 1        Average  433
## 2          Clean  970
## 3          Dirty    2
## 4 Somewhat clean 1254
## 5 Somewhat dirty   30
## 6           <NA>  120

Correcting inconsistency

Now that you’ve identified that dest_size has whitespace inconsistencies and cleanliness has capitalization inconsistencies, you’ll use the new tools at your disposal to fix the inconsistent values in sfo_survey instead of removing the data points entirely, which could add bias to your dataset if more than 5% of the data points need to be dropped.

dplyr and stringr are loaded and sfo_survey is available.

EXERCISES:

Add a column to sfo_survey called dest_size_trimmed that contains the values in the dest_size column with all leading and trailing whitespace removed.

  # Add new columns to sfo_survey
sfo_survey <- sfo_survey %>%
  # dest_size_trimmed: dest_size without whitespace
  mutate(dest_size_trimmed = str_trim(dest_size),
         # cleanliness_lower: cleanliness converted to lowercase
         cleanliness_lower = str_to_lower(cleanliness))

Add another column called cleanliness_lower that contains the values in the cleanliness column converted to all lowercase.

# Count values of dest_size_trimmed
sfo_survey %>%
  count(dest_size_trimmed)
##   dest_size_trimmed    n
## 1               Hub 1757
## 2             Large  144
## 3            Medium  682
## 4             Small  226

Count the number of occurrences of each category in dest_size_trimmed. Count the number of occurrences of each category in cleanliness_lower.

# Count values of cleanliness_lower
sfo_survey %>%
  count(cleanliness_lower)
##   cleanliness_lower    n
## 1           average  433
## 2             clean  970
## 3             dirty    2
## 4    somewhat clean 1254
## 5    somewhat dirty   30
## 6              <NA>  120

Collapsing categories

One of the tablets that participants filled out the sfo_survey on was not properly configured, allowing the response for dest_region to be free text instead of a dropdown menu. This resulted in some inconsistencies in the dest_region variable that you’ll need to correct in this exercise to ensure that the numbers you report to your boss are as accurate as possible.

dplyr and forcats are loaded and sfo_survey is available.

EXERCISES.

Count the categories of dest_region.

# Count categories of dest_region
sfo_survey %>%
  count(dest_region) %>% top_n(n=10)
## Selecting by n
##             dest_region   n
## 1                  Asia 260
## 2 Australia/New Zealand  66
## 3         Canada/Mexico 220
## 4 Central/South America  29
## 5               East US 498
## 6                Europe 401
## 7           Middle East  79
## 8            Midwest US 281
## 9               West US 975

Cleaning text data

Detecting inconsistent text data

You’ve recently received some news that the customer support team wants to ask the SFO survey participants some follow-up questions. However, the auto-dialer that the call center uses isn’t able to parse all of the phone numbers since they’re all in different formats. After some investigation, you found that some phone numbers are written with hyphens (-) and some are written with parentheses ((,)). In this exercise, you’ll figure out which phone numbers have these issues so that you know which ones need fixing.

dplyr and stringr are loaded, and sfo_survey is available.

EXERCISES:

Filter for rows with phone numbers that contain “-”s.

x <- read.csv("./Data/x1.csv")
x <- x[,c(-1)]
y <- read.csv("./Data/y.csv") 
y <- y[,c(-1)]
z <- read.csv("./Data/z.csv")
z <- z[,c(-1)]
sfo_survey <- bind_cols(x,y,z)
## New names:
## * NA -> ...1
## * NA -> ...2
## * NA -> ...3
write.csv(sfo_survey, "./Data/sfo_survey.csv")
sfo_survey <- read.csv("./Data/sfo_survey.csv", sep = ",", col.names = c("id", "airline", "destination", "phone"))
sfo_survey <- sfo_survey %>% select(-id)
sfo_survey <- bind_cols(x,y,z)
## New names:
## * NA -> ...1
## * NA -> ...2
## * NA -> ...3
colnames(sfo_survey)[3] <- "phone"
# Filter for rows with "-" in the phone column
sfo_survey %>%
  filter(str_detect(sfo_survey$phone,"-")) %>% top_n(n=10)
## Selecting by phone
## # A tibble: 11 x 3
##    ...1       ...2                 phone       
##    <chr>      <chr>                <chr>       
##  1 AIR CANADA CALGARY              962-918-6117
##  2 UNITED     CHICAGO-O'HARE       998-931-4783
##  3 AMERICAN   DALLAS-FT. WORTH     962-918-6117
##  4 ALASKA     WASHINGTON DC-DULLES 998-931-4783
##  5 DELTA      MINNEAPOLIS-ST. PAUL 962-918-6117
##  6 AIR CANADA VANCOUVER            998-931-4783
##  7 UNITED     AUSTIN               962-918-6117
##  8 ALASKA     PORTLAND             998-931-4783
##  9 FRONTIER   DENVER               962-918-6117
## 10 SOUTHWEST  LAS VEGAS            998-931-4783
## 11 SOUTHWEST  DENVER               962-918-6117

Filter for rows with phone numbers that contain “(”, or “)”. Remember to use fixed() when searching for parentheses.

# Filter for rows with "(" or ")" in the phone column
sfo_survey %>%
  filter(str_detect(phone, fixed("(")) | str_detect(phone, fixed(")"))) %>% top_n(n=10)
## Selecting by phone
## # A tibble: 11 x 3
##    ...1        ...2        phone         
##    <chr>       <chr>       <chr>         
##  1 UNITED      PHOENIX     (998) 692-1900
##  2 AMERICAN    LOS ANGELES (994) 688-3259
##  3 INTERJET    GUADALAJARA (998) 692-1900
##  4 UNITED INTL HONG KONG   (994) 688-3259
##  5 AIR CANADA  TORONTO     (998) 692-1900
##  6 UNITED      NEWARK      (994) 688-3259
##  7 JETBLUE     BOSTON      (998) 692-1900
##  8 JETBLUE     LONG BEACH  (994) 688-3259
##  9 ALASKA      LAS VEGAS   (998) 692-1900
## 10 INTERJET    GUADALAJARA (994) 688-3259
## 11 QANTAS      SYDNEY      (998) 692-1900

Replacing and removing

In the last exercise, you saw that the phone column of sfo_data is plagued with unnecessary parentheses and hyphens. The customer support team has requested that all phone numbers be in the format “123 456 7890”. In this exercise, you’ll use your new stringr skills to fulfill this request.

dplyr and stringr are loaded and sfo_survey is available.

EXERCISES.

Remove opening and closing parentheses from the phone column. Store this as a variable called phone_no_parens. Remember to use fixed()!

# Remove parentheses from phone column
phone_no_parens <- sfo_survey$phone %>%
  # Remove "("s
  str_remove_all(fixed("(")) %>%
  # Remove ")"s
  str_remove_all(fixed(")"))

Add a new column to sfo_survey called phone_no_parens that contains the contents of phone_no_parens.

# Remove parentheses from phone column
phone_no_parens <- sfo_survey$phone %>%
  # Remove "("s
  str_remove_all(fixed("(")) %>%
  # Remove ")"s
  str_remove_all(fixed(")"))
# Add phone_no_parens as column
sfo_survey %>%
  mutate(phone_no_parens = phone_no_parens) %>% top_n(n=10)
## Selecting by phone_no_parens
## # A tibble: 11 x 4
##    ...1       ...2                 phone          phone_no_parens
##    <chr>      <chr>                <chr>          <chr>          
##  1 UNITED     PHOENIX              (998) 692-1900 998 692-1900   
##  2 UNITED     CHICAGO-O'HARE       998-931-4783   998-931-4783   
##  3 INTERJET   GUADALAJARA          (998) 692-1900 998 692-1900   
##  4 ALASKA     WASHINGTON DC-DULLES 998-931-4783   998-931-4783   
##  5 AIR CANADA TORONTO              (998) 692-1900 998 692-1900   
##  6 AIR CANADA VANCOUVER            998-931-4783   998-931-4783   
##  7 JETBLUE    BOSTON               (998) 692-1900 998 692-1900   
##  8 ALASKA     PORTLAND             998-931-4783   998-931-4783   
##  9 ALASKA     LAS VEGAS            (998) 692-1900 998 692-1900   
## 10 SOUTHWEST  LAS VEGAS            998-931-4783   998-931-4783   
## 11 QANTAS     SYDNEY               (998) 692-1900 998 692-1900

Create a new column of sfo_survey called phone_clean containing the values of phone_no_parens with all hyphens replaced with spaces.

# Remove parentheses from phone column
phone_no_parens <- sfo_survey$phone %>%
  # Remove "("s
  str_remove_all(fixed("(")) %>%
  # Remove ")"s
  str_remove_all(fixed(")"))
# Add phone_no_parens as column
sfo_survey %>%
  mutate(phone_no_parens = phone_no_parens,
  # Replace all hyphens in phone_no_parens with spaces
         phone_clean = str_replace_all(phone_no_parens, "-", " ")) %>% top_n(n=10)
## Selecting by phone_clean
## # A tibble: 11 x 5
##    ...1       ...2                 phone          phone_no_parens phone_clean 
##    <chr>      <chr>                <chr>          <chr>           <chr>       
##  1 UNITED     PHOENIX              (998) 692-1900 998 692-1900    998 692 1900
##  2 UNITED     CHICAGO-O'HARE       998-931-4783   998-931-4783    998 931 4783
##  3 INTERJET   GUADALAJARA          (998) 692-1900 998 692-1900    998 692 1900
##  4 ALASKA     WASHINGTON DC-DULLES 998-931-4783   998-931-4783    998 931 4783
##  5 AIR CANADA TORONTO              (998) 692-1900 998 692-1900    998 692 1900
##  6 AIR CANADA VANCOUVER            998-931-4783   998-931-4783    998 931 4783
##  7 JETBLUE    BOSTON               (998) 692-1900 998 692-1900    998 692 1900
##  8 ALASKA     PORTLAND             998-931-4783   998-931-4783    998 931 4783
##  9 ALASKA     LAS VEGAS            (998) 692-1900 998 692-1900    998 692 1900
## 10 SOUTHWEST  LAS VEGAS            998-931-4783   998-931-4783    998 931 4783
## 11 QANTAS     SYDNEY               (998) 692-1900 998 692-1900    998 692 1900

Invalid phone numbers

The customer support team is grateful for your work so far, but during their first day of calling participants, they ran into some phone numbers that were invalid. In this exercise, you’ll remove any rows with invalid phone numbers so that these faulty numbers don’t keep slowing the team down.

dplyr and stringr are loaded and sfo_survey is available.

EXERCISES:

Examine the invalid phone numbers by filtering for numbers whose length is not equal to 12.

# Check out the invalid numbers
sfo_survey %>%
  filter(str_length(phone) != 12) %>% top_n(n=10)
## Selecting by phone
## # A tibble: 11 x 3
##    ...1       ...2                 phone         
##    <chr>      <chr>                <chr>         
##  1 UNITED     PHOENIX              (998) 692-1900
##  2 UNITED     BAKERSFIELD          0244-5        
##  3 INTERJET   GUADALAJARA          (998) 692-1900
##  4 COPA       PANAMA CITY          925 8846      
##  5 AIR CANADA TORONTO              (998) 692-1900
##  6 SOUTHWEST  PHOENIX              1623          
##  7 DELTA      MINNEAPOLIS-ST. PAUL 665-803       
##  8 JETBLUE    BOSTON               (998) 692-1900
##  9 ALASKA     LAS VEGAS            (998) 692-1900
## 10 ALASKA     SAN JOSE DEL CABO    38515         
## 11 QANTAS     SYDNEY               (998) 692-1900

Remove the rows with invalid numbers by filtering for numbers with a length of exactly 12.

# Remove rows with invalid numbers
sfo_survey %>%
   filter (str_length(phone) == 12) %>% top_n(n=10)
## Selecting by phone
## # A tibble: 11 x 3
##    ...1        ...2                 phone       
##    <chr>       <chr>                <chr>       
##  1 UNITED      LOS ANGELES          994 923 6634
##  2 UNITED      CHICAGO-O'HARE       998-931-4783
##  3 AIR CANADA  VANCOUVER            994 923 6634
##  4 ALASKA      WASHINGTON DC-DULLES 998-931-4783
##  5 UNITED      SAN DIEGO            994 923 6634
##  6 AIR CANADA  VANCOUVER            998-931-4783
##  7 AER LINGUS  DUBLIN               994 923 6634
##  8 ALASKA      PORTLAND             998-931-4783
##  9 UNITED INTL SHANGHAI             994 923 6634
## 10 SOUTHWEST   LAS VEGAS            998-931-4783
## 11 UNITED      BALTIMORE            994 923 6634

Advanced Data Problems

Uniformity

Date uniformity

In this chapter, you work at an asset management company and you’ll be working with the accounts dataset, which contains information about each customer, the amount in their account, and the date their account was opened. Your boss has asked you to calculate some summary statistics about the average value of each account and whether the age of the account is associated with a higher or lower account value. Before you can do this, you need to make sure that the accounts dataset you’ve been given doesn’t contain any uniformity problems. In this exercise, you’ll investigate the date_opened column and clean it up so that all the dates are in the same format.

dplyr and lubridate are loaded and accounts is available.

accounts <- readRDS("./Data/ch3_1_accounts.rds")
EXERCISES:

Take a look at the head of accounts to get a sense of the data you’re working with.

# Check out the accounts data frame
head(accounts)
##         id      date_opened    total
## 1 A880C79F       2003-10-19   169305
## 2 BE8222DF October 05, 2018   107460
## 3 19F9E113       2008-07-29 15297152
## 4 A2FE52A3       2005-06-09 14897272
## 5 F6DC2C08       2012-03-31   124568
## 6 D2E55799       2007-06-20 13635752

Convert the dates in the date_opened column to the same format using the formats vector and store this as a new column called date_opened_clean.

# Define the date formats
formats <- c("%Y-%m-%d", "%B %d, %Y")

# Convert dates to the same format
accounts %>%
  mutate(date_opened_clean =  parse_date_time(date_opened, formats)) %>% top_n(n=10)
## Warning: Problem with `mutate()` input `date_opened_clean`.
## i  36 failed to parse.
## i Input `date_opened_clean` is `parse_date_time(date_opened, formats)`.
## Warning: 36 failed to parse.
## Selecting by date_opened_clean
##          id date_opened    total date_opened_clean
## 1  3E97F253  2019-06-03 14515800        2019-06-03
## 2  2322DFB4  2018-04-07   189524        2018-04-07
## 3  645335B2  2018-11-16   154001        2018-11-16
## 4  A7BFAA72  2019-11-12   133790        2019-11-12
## 5  236A1D51  2019-10-01 18486936        2019-10-01
## 6  BE411172  2017-02-24 17689984        2017-02-24
## 7  14A2DDB7  2019-03-06 12808952        2019-03-06
## 8  305EEAA8  2018-09-01 14417728        2018-09-01
## 9  0E5B69F5  2018-05-07 18650632        2018-05-07
## 10 5275B518  2017-11-23    71665        2017-11-23

Currency uniformity

Now that your dates are in order, you’ll need to correct any unit differences. When you first plot the data, you’ll notice that there’s a group of very high values, and a group of relatively lower values. The bank has two different offices - one in New York, and one in Tokyo, so you suspect that the accounts managed by the Tokyo office are in Japanese yen instead of U.S. dollars. Luckily, you have a data frame called account_offices that indicates which office manages each customer’s account, so you can use this information to figure out which totals need to be converted from yen to dollars.

The formula to convert yen to dollars is USD = JPY / 104.

dplyr and ggplot2 are loaded and the accounts and account_offices data frames are available.

EXERCISES.

Create a scatter plot with date_opened on the x-axis and total on the y-axis.

  # Scatter plot of opening date and total amount
accounts %>%
  ggplot(aes(x = date_opened, y = total)) +
  geom_point()

Left join accounts and account_offices by their id columns.

# Scatter plot of opening date and total amount
accounts %>%
  ggplot(aes(x = date_opened, y = total)) +
  geom_point()

k <- read.csv("./Data/k.csv")
l <- read.csv("./Data/l.csv")
account_offices <- bind_cols(k,l)
## New names:
## * X -> X...1
## * x -> x...2
## * X -> X...3
## * x -> x...4
account_offices <- account_offices[,c(-1,-3)]
account_offices <- bind_cols(k,l)
## New names:
## * X -> X...1
## * x -> x...2
## * X -> X...3
## * x -> x...4
colnames(account_offices)[1] <- "id"
colnames(account_offices)[2] <- "office"
account_offices <- read.delim("./Data/account_offices.txt", sep = "")
head(account_offices)
##             id office
## A880C79F   New   York
## BE8222DF   New   York
## 19F9E113 Tokyo       
## A2FE52A3 Tokyo       
## F6DC2C08   New   York
## D2E55799 Tokyo
# Left join accounts and account_offices by id
accounts %>%
  left_join(account_offices, by = "id") %>% top_n(n=10)
## Selecting by office
## [1] id          date_opened total       office     
## <0 rows> (or 0-length row.names)

Create a scatter plot of your new uniform data using date_opened on the x-axis and total_usd on the y-axis.

# Scatter plot of opening date and total amount
accounts %>%
  ggplot(aes(x = date_opened, y = total)) +
  geom_point()

# Left join accounts to account_offices by id
accounts %>%
  left_join(account_offices, by = "id") %>%
  # Convert totals from the Tokyo office to USD
  mutate(total_usd = ifelse(office == "Tokyo", total / 104, total)) %>%
  # Scatter plot of opening date vs total_usd
  ggplot(aes(x = date_opened, y = total_usd)) +
    geom_point()

Completeness

Visualizing missing data

Dealing with missing data is one of the most common tasks in data science. There are a variety of types of missingness, as well as a variety of types of solutions to missing data.

You just received a new version of the accounts data frame containing data on the amount held and amount invested for new and existing customers. However, there are rows with missing inv_amount values.

You know for a fact that most customers below 25 do not have investment accounts yet, and suspect it could be driving the missingness. The dplyr and visdat packages have been loaded and accounts is available.

EXERCISES:

Visualize the missing values in accounts by column using a function from the visdat package.

# Visualize the missing values by column
library(visdat)
vis_miss(accounts)

Add a logical column to accounts called missing_inv that indicates whether each row is missing the inv_amount or not.

# Visualize the missing values by column
vis_miss(accounts)

Record Linkage

Comparing strings

Small distance, small difference

In the video exercise, you learned that there are multiple ways to calculate how similar or different two strings are. Now you’ll practice using the stringdist package to compute string distances using various methods. It’s important to be familiar with different methods, as some methods work better on certain datasets, while others work better on other datasets.

The stringdist package has been loaded for you.

EXERCISES:

Calculate the Damerau-Levenshtein distance between “las angelos” and “los angeles”.

library(stringdist)
# Calculate Damerau-Levenshtein distance
stringdist("las angelos", "los angeles", method = "dl")
## [1] 2

Calculate the Longest Common Substring (LCS) distance between “las angelos” and “los angeles”.

# Calculate LCS distance
stringdist("las angelos", "los angeles", method = "lcs")
## [1] 4

Calculate the Jaccard distance between “las angelos” and “los angeles”.

stringdist("las angelos", "los angeles", method = "jaccard")
## [1] 0
Fixing typos with string distance

In this chapter, one of the datasets you’ll be working with, zagat, is a set of restaurants in New York, Los Angeles, Atlanta, San Francisco, and Las Vegas. The data is from Zagat, a company that collects restaurant reviews, and includes the restaurant names, addresses, phone numbers, as well as other restaurant information.

The city column contains the name of the city that the restaurant is located in. However, there are a number of typos throughout the column. Your task is to map each city to one of the five correctly-spelled cities contained in the cities data frame.

dplyr and fuzzyjoin are loaded, and zagat and cities are available.

zagat <- readRDS("./Data/zagat.rds")

EXERCISES:

Count the number of each variation of city name in zagat.

# Count the number of each city variation
zagat %>%
  count(city)
##            city  n
## 1       atlanta 64
## 2   los angeles 72
## 3      new york 98
## 4     las vegas 26
## 5 san francisco 50

Left join zagat and cities based on string distance using the city and city_actual columns. Select the name, city, and city_actual columns.

cities <- readxl::read_xlsx("./Data/cities.xlsx")
head(cities)
## # A tibble: 5 x 1
##   city_actual  
##   <chr>        
## 1 atlanta      
## 2 los angeles  
## 3 las vegas    
## 4 new york     
## 5 san francisco
library(fuzzyjoin)
# Join zagat and cities and look at results
UI <- zagat %>%
  # Left join based on stringdist using city and city_actual cols
  stringdist_left_join(cities, by = c("city" = "city_actual")) %>%
  # Select the name, city, and city_actual cols
  select(name, city, city_actual) 
head(UI)
##              name        city city_actual
## 1   apple pan the los angeles los angeles
## 2     asahi ramen los angeles los angeles
## 3      baja fresh los angeles los angeles
## 4   belvedere the los angeles los angeles
## 5 benita's frites los angeles los angeles
## 6       bernard's los angeles los angeles

Generating and comparing pairs

Pair blocking

Zagat and Fodor’s are both companies that gather restaurant reviews. The zagat and fodors datasets both contain information about various restaurants, including addresses, phone numbers, and cuisine types. Some restaurants appear in both datasets, but don’t necessarily have the same exact name or phone number written down. In this chapter, you’ll work towards figuring out which restaurants appear in both datasets.

The first step towards this goal is to generate pairs of records so that you can compare them. In this exercise, you’ll first generate all possible pairs, and then use your newly-cleaned city column as a blocking variable.

zagat and fodors are available.

fodors <- readRDS("./Data/fodors.rds")
EXERCISES:

Load the reclin

# Load reclin
library(reclin)

# Generate all possible pairs
pair_blocking(zagat, fodors) 
## Simple blocking
##   No blocking used.
##   First data set:  310 records
##   Second data set: 533 records
##   Total number of pairs: 165 230 pairs
## 
## ldat with 165 230 rows and 2 columns
##          x   y
## 1        1   1
## 2        2   1
## 3        3   1
## 4        4   1
## 5        5   1
## 6        6   1
## 7        7   1
## 8        8   1
## 9        9   1
## 10      10   1
## :        :   :
## 165221 301 533
## 165222 302 533
## 165223 303 533
## 165224 304 533
## 165225 305 533
## 165226 306 533
## 165227 307 533
## 165228 308 533
## 165229 309 533
## 165230 310 533

Use pair blocking to generate only pairs that have matching values in the city column.

# Load reclin
library(reclin)

# Generate pairs with same city
pair_blocking(zagat, fodors, blocking_var = "city")
## Simple blocking
##   Blocking variable(s): city
##   First data set:  310 records
##   Second data set: 533 records
##   Total number of pairs: 40 532 pairs
## 
## ldat with 40 532 rows and 2 columns
##         x   y
## 1       1   1
## 2       1   2
## 3       1   3
## 4       1   4
## 5       1   5
## 6       1   6
## 7       1   7
## 8       1   8
## 9       1   9
## 10      1  10
## :       :   :
## 40523 310 414
## 40524 310 415
## 40525 310 416
## 40526 310 417
## 40527 310 418
## 40528 310 419
## 40529 310 420
## 40530 310 421
## 40531 310 422
## 40532 310 423

Comparing pairs

Now that you’ve generated the pairs of restaurants, it’s time to compare them. You can easily customize how you perform your comparisons using the by and default_comparator arguments. There’s no right answer as to what each should be set to, so in this exercise, you’ll try a couple options out.

dplyr and reclin are loaded and zagat and fodors are available.

EXERCISES:

Compare pairs by name using lcs() distance.

# Generate pairs
ux <- pair_blocking(zagat, fodors, blocking_var = "city") %>%
  # Compare pairs by name using lcs()
  compare_pairs(by = "name",
      default_comparator = lcs())
head(ux)
## ldat with 6 rows and 3 columns
##   x y      name
## 1 1 1 0.3157895
## 2 1 2 0.3225806
## 3 1 3 0.2307692
## 4 1 4 0.2608696
## 5 1 5 0.4545455
## 6 1 6 0.2142857

Compare pairs by name, phone, and addr using jaro_winkler().

 # Generate pairs

 uil <- pair_blocking(zagat, fodors, blocking_var = "city") %>%
  # Compare pairs by name, phone, addr
  compare_pairs(by = c("name", "phone", "addr"),
      default_comparator = jaro_winkler())
head(uil)
## ldat with 6 rows and 5 columns
##   x y      name     phone      addr
## 1 1 1 0.4871062 0.6746032 0.5703661
## 2 1 2 0.5234025 0.5555556 0.6140351
## 3 1 3 0.4564103 0.7222222 0.5486355
## 4 1 4 0.5102564 0.6746032 0.6842105
## 5 1 5 0.5982906 0.5793651 0.5515351
## 6 1 6 0.3581197 0.6746032 0.4825911

Scoring and linking

Putting it together

During this chapter, you’ve cleaned up the city column of zagat using string similarity, as well as generated and compared pairs of restaurants from zagat and fodors. The end is near - all that’s left to do is score and select pairs and link the data together, and you’ll be able to begin your analysis in no time!

reclin and dplyr are loaded and zagat and fodors are available.

EXERCISES.

Score the pairs of records probabilistically.

# Create pairs
 um <-pair_blocking(zagat, fodors, blocking_var = "city") %>%
  # Compare pairs
  compare_pairs(by = "name", default_comparator = jaro_winkler()) %>%
  # Score pairs
  score_problink()
## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
head(um)
## ldat with 6 rows and 4 columns
##   x y      name      weight
## 1 1 1 0.4871062 -0.01805476
## 2 1 2 0.5234025  0.03434922
## 3 1 3 0.4564103 -0.05877132
## 4 1 4 0.5102564  0.01479485
## 5 1 5 0.5982906  0.16049721
## 6 1 6 0.3581197 -0.17121520

Select the pairs that are considered matches.

# Create pairs
uk <- pair_blocking(zagat, fodors, blocking_var = "city") %>%
  # Compare pairs
  compare_pairs(by = "name", default_comparator = jaro_winkler()) %>%
  # Score pairs
  score_problink() %>%
  # Select pairs
  select_n_to_m()

head(uk)
## ldat with 6 rows and 5 columns
##   x y      name      weight select
## 1 1 1 0.4871062 -0.01805476  FALSE
## 2 1 2 0.5234025  0.03434922  FALSE
## 3 1 3 0.4564103 -0.05877132  FALSE
## 4 1 4 0.5102564  0.01479485  FALSE
## 5 1 5 0.5982906  0.16049721  FALSE
## 6 1 6 0.3581197 -0.17121520  FALSE

Link the two data frames together.

# Create pairs
uh <- pair_blocking(zagat, fodors, blocking_var = "city") %>%
  # Compare pairs
  compare_pairs(by = "name", default_comparator = jaro_winkler()) %>%
  # Score pairs
  score_problink() %>%
  # Select pairs
  select_n_to_m() %>%
  # Link data 
  link()

head(uh)
##   id.x          name.x                         addr.x      city.x      phone.x
## 1    0   apple pan the            10801 w. pico blvd. los angeles 310-475-3585
## 2    1     asahi ramen            2027 sawtelle blvd. los angeles 310-479-2231
## 3    2      baja fresh                3345 kimber dr. los angeles 805-498-4049
## 4    3   belvedere the 9882 little santa monica blvd. los angeles 310-788-2306
## 5    4 benita's frites       1433 third st. promenade los angeles 310-458-2889
## 6    5       bernard's               515 s. olive st. los angeles 213-612-1580
##             type.x class.x id.y                   name.y
## 1         american     534  124 california pizza kitchen
## 2     noodle shops     535  128                chan dara
## 3          mexican     536  121               ca  ` brea
## 4 pacific new wave     537  131                   dive !
## 5        fast food     538  149       louise's trattoria
## 6      continental     539  172             trader vic's
##                     addr.y      city.y      phone.y        type.y class.y
## 1       207 s. beverly dr. los angeles 310-275-1101   californian     121
## 2   310 n. larchmont blvd. los angeles 213-467-1052         asian     125
## 3      346 s. la brea ave. los angeles 213-938-2863       italian     118
## 4 10250 santa monica blvd. los angeles     310-788- dive american     128
## 5     4500 los feliz blvd. los angeles 213-667-0777       italian     146
## 6      9876 wilshire blvd. los angeles 310-276-6345         asian     169

11.- Working with Dates and Times in R

Dates and Times in R

Dates and Times in R

Specifying dates

As you saw in the video, R doesn’t know something is a date unless you tell it. If you have a character string that represents a date in the ISO 8601 standard you can turn it into a Date using the as.Date() function. Just pass the character string (or a vector of character strings) as the first argument.

In this exercise you’ll convert a character string representation of a date to a Date object.

EXERCISES.

We’ve stored the string “2013-04-03” in a variable called x.

# The date R 3.0.0 was released
x <- "2013-04-03"

Use str() to look at the structure of x and confirm it’s just a character string.

# Examine structure of x
str(x)
##  chr "2013-04-03"

Convert x to a date using as.Date().

# Use as.Date() to interpret x as a date
x_date <- as.Date(x)

Use str() to look at the structure of x_date and confirm it’s a Date.

# Examine structure of x_date
str(x_date)
##  Date[1:1], format: "2013-04-03"

Now use as.Date() to store the date April 10, 2014.

# Store April 10 2014 as a Date
april_10_2014 <- as.Date("2014-04-10")

Automatic import

Sometimes you’ll need to input a couple of dates by hand using as.Date() but it’s much more common to have a column of dates in a data file.

Some functions that read in data will automatically recognize and parse dates in a variety of formats. In particular the import functions, like read_csv(), in the readr package will recognize dates in a few common formats.

There is also the anytime() function in the anytime package whose sole goal is to automatically parse strings as dates regardless of the format.

Try them both out in this exercise.

EXERCISES:

Use read_csv() to read in the CSV file rversions.csv as releases.

# Load the readr package
library(readr)
# Use read_csv() to import rversions.csv
releases <- read.delim("./Data/rversions.txt", sep = ",")
head(releases)
##   major minor patch       date             datetime            time  type
## 1     0    60    NA 1997-12-04 1997-12-04T08:47:58Z 08:47:58.000000 patch
## 2     0    61    NA 1997-12-21 1997-12-21T13:09:22Z 13:09:22.000000 minor
## 3     0    61     1 1998-01-10 1998-01-10T00:31:55Z 00:31:55.000000 patch
## 4     0    61     2 1998-03-14 1998-03-14T19:25:55Z 19:25:55.000000 patch
## 5     0    61     3 1998-05-02 1998-05-02T07:58:17Z 07:58:17.000000 patch
## 6     0    62    NA 1998-06-14 1998-06-14T12:56:20Z 12:56:20.000000 minor
# Examine the structure of the date column
str(releases$date)
##  chr [1:105] "1997-12-04" "1997-12-21" "1998-01-10" "1998-03-14" ...

Use str() to examine the structure of the date column. Notice it’s already a Date object.

# Load the anytime package
library(anytime)
# Various ways of writing Sep 10 2009
sep_10_2009 <- c("September 10 2009", "2009-09-10", "10 Sep 2009", "09-10-2009")

We’ve loaded anytime and created an object called sep_10_2009. Use the anytime() function to parse sep_10_2009.

# Use anytime() to parse sep_10_2009
anytime(sep_10_2009)
## [1] "2009-09-10 CEST" "2009-09-10 CEST" "2009-09-10 CEST" "2009-09-10 CEST"

Why use dates?

Plotting

If you plot a Date on the axis of a plot, you expect the dates to be in calendar order, and that’s exactly what happens with plot() or ggplot().

In this exercise you’ll make some plots with the R version releases data from the previous exercises using ggplot2. There are two big differences when a Date is on an axis:

If you specify limits they must be Date objects.

To control the behavior of the scale you use the scale_x_date() function.

Have a go in this exercise where you explore how often R releases occur.

EXERCISES.
library(ggplot2)

Make a plot of releases over time by setting the x argument of the aes() function to the date column.

# Set the x axis to the date column
ggplot(releases, aes(x = date, y = type)) +
  geom_line(aes(group = 1, color = factor(major)))

Zoom in to the period from 2010 to 2014 by specifying limits from “2010-01-01” to “2014-01-01”. Notice these strings need to be wrapped in as.Date() to be interpreted as Date objects.

# Limit the axis to between 2010-01-01 and 2014-01-01
#ggplot(releases, aes(x = date, y = type)) +
  #geom_line(aes(group = 1, color = factor(major))) +
  #xlim(as.Date("2010-01-01"), as.Date("2014-01-01"))

Adjust the axis labeling by specifying date_breaks of “10 years” and date_labels of “%Y”.

# Specify breaks every ten years and labels with "%Y"
#ggplot(releases, aes(x = date, y = type)) +
  #geom_line(aes(group = 1, color = factor(major))) +
  #scale_x_date(date_breaks = "10 years", date_labels = "%Y")

Arithmetic and logical operators

Since Date objects are internally represented as the number of days since 1970-01-01 you can do basic math and comparisons with dates. You can compare dates with the usual logical operators (<, ==, > etc.), find extremes with min() and max(), and even subtract two dates to find out the time between them.

In this exercise you’ll see how these operations work by exploring the last R release. You’ll see Sys.date() in the code, it simply returns today’s date.

EXERCISES:

Find the date of the most recent release by calling max() on the date column in releases.

# Find the largest date
last_release_date <- max(releases$date)

Find the rows in releases that have the most recent date, by specifying the comparison date == last_release_date in filter()

# Filter row for last release
last_release <- filter(releases, date == last_release_date)

Print last_release to see which release this was.

# Print last_release
head(last_release)
##   major minor patch       date             datetime            time  type
## 1     3     4     1 2017-06-30 2017-06-30T07:04:11Z 07:04:11.824142 patch

Calculate how long it has been since the most recent release by subtracting last_release_date from Sys.Date().

# How long since last release?
#Sys.Date() - last_release_date

What about times?

Getting datetimes into R

Just like dates without times, if you want R to recognize a string as a datetime you need to convert it, although now you use as.POSIXct(). as.POSIXct() expects strings to be in the format YYYY-MM-DD HH:MM:SS.

The only tricky thing is that times will be interpreted in local time based on your machine’s set up. You can check your timezone with Sys.timezone(). If you want the time to be interpreted in a different timezone, you just set the tz argument of as.POSIXct(). You’ll learn more about time zones in Chapter 4.

In this exercise you’ll input a couple of datetimes by hand and then see that read_csv() also handles datetimes automatically in a lot of cases.

EXERCISES:

Use as.POSIXct() and an appropriate string to input the datetime corresponding to Oct 1st 2010 at 12:12:00.

# Use as.POSIXct to enter the datetime 
as.POSIXct("2010-10-01 12:12:00")
## [1] "2010-10-01 12:12:00 CEST"

Enter the same datetime again, but now specify the timezone as “America/Los_Angeles”.

# Use as.POSIXct again but set the timezone to `"America/Los_Angeles"`
as.POSIXct("2010-10-01 12:12:00", tz = "America/Los_Angeles")
## [1] "2010-10-01 12:12:00 PDT"

Use read_csv() to read in rversions.csv again.

# Use read_csv to import rversions.csv
#releases <- read_csv("rversions.csv")

Examine the structure of the datetime column to verify read_csv() has correctly

# Examine structure of datetime column
str(releases$datetime)
##  chr [1:105] "1997-12-04T08:47:58Z" "1997-12-21T13:09:22Z" ...

Datetimes behave nicely too

Just like Date objects, you can plot and do math with POSIXct objects.

As an example, in this exercise you’ll see how quickly people download new versions of R, by examining the download logs from the RStudio CRAN mirror.

R 3.2.0 was released at “2015-04-16 07:13:33” so cran-logs_2015-04-17.csv contains a random sample of downloads on the 16th, 17th and 18th.

EXERCISES:

Use read_csv() to import cran-logs_2015-04-17.csv.

# Import "cran-logs_2015-04-17.csv" with read_csv()
logs <- read_csv("./Data/cran-logs_2015-04-17.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   datetime = col_datetime(format = ""),
##   r_version = col_character(),
##   country = col_character()
## )

Print logs to see the information we have on each download.

# Print logs
head(logs)
## # A tibble: 6 x 3
##   datetime            r_version country
##   <dttm>              <chr>     <chr>  
## 1 2015-04-16 22:40:19 3.1.3     CO     
## 2 2015-04-16 09:11:04 3.1.3     GB     
## 3 2015-04-16 17:12:37 3.1.3     DE     
## 4 2015-04-18 12:34:43 3.2.0     GB     
## 5 2015-04-16 04:49:18 3.1.3     PE     
## 6 2015-04-16 06:40:44 3.1.3     TW

Store the R 3.2.0 release time as a POSIXct object.

# Store the release time as a POSIXct object
release_time <- as.POSIXct("2015-04-16 07:13:33", tz = "UTC")

Find out when the first request for 3.2.0 was made by filtering for values in the datetime column that are greater than release_time.

# When is the first download of 3.2.0?
logs %>% 
  filter(datetime > release_time,
    r_version == "3.2.0")
## # A tibble: 35,826 x 3
##    datetime            r_version country
##    <dttm>              <chr>     <chr>  
##  1 2015-04-18 12:34:43 3.2.0     GB     
##  2 2015-04-18 15:41:32 3.2.0     CA     
##  3 2015-04-18 14:58:41 3.2.0     IE     
##  4 2015-04-18 16:44:45 3.2.0     US     
##  5 2015-04-18 04:34:35 3.2.0     US     
##  6 2015-04-18 22:29:45 3.2.0     CH     
##  7 2015-04-17 16:21:06 3.2.0     US     
##  8 2015-04-18 20:34:57 3.2.0     AT     
##  9 2015-04-17 18:23:19 3.2.0     US     
## 10 2015-04-18 03:00:31 3.2.0     US     
## # ... with 35,816 more rows

Finally see how downloads increase by creating histograms of download time for 3.2.0 and the previous version 3.1.3. We’ve provided most of the code, you just need to specify the x aesthetic to be the datetime column.

# Examine histograms of downloads by version
ggplot(logs, aes(x = datetime)) +
  geom_histogram() +
  geom_vline(aes(xintercept = as.numeric(release_time)))+
  facet_wrap(~ r_version, ncol = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Parsing and Manipulating Dates and Times with lubridate

Parsing dates with lubridate

Selecting the right parsing function

lubridate provides a set of functions for parsing dates of a known order. For example, ymd() will parse dates with year first, followed by month and then day. The parsing is flexible, for example, it will parse the m whether it is numeric (e.g. 9 or 09), a full month name (e.g. September), or an abbreviated month name (e.g. Sep).

All the functions with y, m and d in any order exist. If your dates have times as well, you can use the functions that start with ymd, dmy, mdy or ydm and are followed by any of _h, _hm or _hms.

To see all the functions available look at ymd() for dates and ymd_hms() for datetimes.

Here are some challenges. In each case we’ve provided a date, your job is to choose the correct function to parse it.

EXERCISES.

For each date the ISO 8601 format is displayed as a comment after it, to help you check your work

Choose the correct function to parse x.

# Parse x 
x <- "2010 September 20th" # 2010-09-20
ymd(x)
## [1] "2010-09-20"

Choose the correct function to parse y.

# Parse y 
y <- "02.01.2010"  # 2010-01-02
dmy(y)
## [1] "2010-01-02"

Choose the correct function to parse z.

# Parse z 
z <- "Sep, 12th 2010 14:00"  # 2010-09-12T14:00
mdy_hm(z)
## [1] "2010-09-12 14:00:00 UTC"

Specifying an order with parse_date_time()

What about if you have something in a really weird order like dym_msh? There’s no named function just for that order, but that is where parse_date_time() comes in. parse_date_time() takes an additional argument, orders, where you can specify the order of the components in the date.

For example, to parse “2010 September 20th” you could say parse_date_time(“2010 September 20th”, orders = “ymd”) and that would be equivalent to using the ymd() function from the previous exercise.

One advantage of parse_date_time() is that you can use more format characters. For example, you can specify weekday names with A, I for 12 hour time, am/pm indicators with p and many others. You can see a whole list on the help page ?parse_date_time.

Another big advantage is that you can specify a vector of orders, and that allows parsing of dates where multiple formats might be used.

You’ll try it out in this exercise.

EXERCISES:

x is a trickier datetime. Use the clues in the instructions to parse x.

# Specify an order string to parse x
x <- "Monday June 1st 2010 at 4pm"
#parse_date_time(x, orders = "ABdyIp")

two_orders has two different orders, parse both by specifying the order to be c(“mdy”, “dmy”).

# Specify order to include both "mdy" and "dmy"
two_orders <- c("October 7, 2001", "October 13, 2002", "April 13, 2003", 
  "17 April 2005", "23 April 2017")
parse_date_time(two_orders, orders = c("mdy", "dmy"))
## [1] "2001-10-07 UTC" "2002-10-13 UTC" "2003-04-13 UTC" "2005-04-17 UTC"
## [5] "2017-04-23 UTC"

Parse short_dates with orders = c(“dOmY”, “OmY”, “Y”). What happens to the dates that don’t have months or days specified?

# Specify order to include "dOmY", "OmY" and "Y"
short_dates <- c("11 December 1282", "May 1372", "1253")
parse_date_time(short_dates, orders = c("dOmY", "OmY", "Y"))
## [1] "1282-12-11 UTC" "1372-05-01 UTC" "1253-01-01 UTC"

Weather in Auckland

Import daily weather data

In practice you won’t be parsing isolated dates and times, they’ll be part of a larger dataset. Throughout the chapter after you’ve mastered a skill with a simpler example (the release times of R for example), you’ll practice your lubridate skills in context by working with weather data from Auckland NZ.

There are two data sets: akl_weather_daily.csv a set of once daily summaries for 10 years, and akl_weather_hourly_2016.csv observations every half hour for 2016. You’ll import the daily data in this exercise and the hourly weather in the next exercise.

You’ll be using functions from dplyr, so if you are feeling rusty, you might want to review filter(), select() and mutate().

EXERCISES.
library(lubridate)
library(readr)
library(dplyr)
library(ggplot2)

Import the daily data, “akl_weather_daily.csv” with read_csv().

# Import CSV with read_csv()
akl_daily_raw <- read_csv("./Data/akl_weather_daily.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   date = col_character(),
##   max_temp = col_double(),
##   min_temp = col_double(),
##   mean_temp = col_double(),
##   mean_rh = col_double(),
##   events = col_character(),
##   cloud_cover = col_double()
## )

Print akl_daily_raw to confirm the date column hasn’t been interpreted as a date. Can you see why?

# Print akl_daily_raw
head(akl_daily_raw)
## # A tibble: 6 x 7
##   date     max_temp min_temp mean_temp mean_rh events cloud_cover
##   <chr>       <dbl>    <dbl>     <dbl>   <dbl> <chr>        <dbl>
## 1 2007-9-1       60       51        56      75 <NA>             4
## 2 2007-9-2       60       53        56      82 Rain             4
## 3 2007-9-3       57       51        54      78 <NA>             6
## 4 2007-9-4       64       50        57      80 Rain             6
## 5 2007-9-5       53       48        50      90 Rain             7
## 6 2007-9-6       57       42        50      69 <NA>             1

Using mutate() overwrite the column date with a parsed version of date. You need to specify the parsing function. Hint: the first date should be September 1.

# Parse date 
akl_daily <- akl_daily_raw %>%
  mutate(date = ymd(date))

Print akl_daily to verify the date column is now a Date.

# Print akl_daily
head(akl_daily)
## # A tibble: 6 x 7
##   date       max_temp min_temp mean_temp mean_rh events cloud_cover
##   <date>        <dbl>    <dbl>     <dbl>   <dbl> <chr>        <dbl>
## 1 2007-09-01       60       51        56      75 <NA>             4
## 2 2007-09-02       60       53        56      82 Rain             4
## 3 2007-09-03       57       51        54      78 <NA>             6
## 4 2007-09-04       64       50        57      80 Rain             6
## 5 2007-09-05       53       48        50      90 Rain             7
## 6 2007-09-06       57       42        50      69 <NA>             1

Take a look at the data by plotting date on the x-axis and max_temp of the y-axis.

# Plot to check work
ggplot(akl_daily, aes(x = date, y = max_temp)) +
  geom_line()
## Warning: Removed 1 row(s) containing missing values (geom_path).

Import hourly weather data

The hourly data is a little different. The date information is spread over three columns year, month and mday, so you’ll need to use make_date() to combine them.

Then the time information is in a separate column again, time. It’s quite common to find date and time split across different variables. One way to construct the datetimes is to paste the date and time together and then parse them. You’ll do that in this exercise.

EXERCISES:
library(lubridate)
library(readr)
library(dplyr)
library(ggplot2)

Import the hourly data, “akl_weather_hourly_2016.csv” with read_csv(), then print akl_hourly_raw to confirm the date is spread over year, month and mday.

# Import "akl_weather_hourly_2016.csv"
akl_hourly_raw <- read_csv("./Data/akl_weather_hourly_2016.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   year = col_double(),
##   month = col_double(),
##   mday = col_double(),
##   time = col_time(format = ""),
##   temperature = col_double(),
##   weather = col_character(),
##   conditions = col_character(),
##   events = col_character(),
##   humidity = col_double(),
##   date_utc = col_datetime(format = "")
## )
# Print akl_hourly_raw
head(akl_hourly_raw)
## # A tibble: 6 x 10
##    year month  mday time  temperature weather conditions events humidity
##   <dbl> <dbl> <dbl> <tim>       <dbl> <chr>   <chr>      <chr>     <dbl>
## 1  2016     1     1 00:00          68 Clear   Clear      <NA>         68
## 2  2016     1     1 00:30          68 Clear   Clear      <NA>         68
## 3  2016     1     1 01:00          68 Clear   Clear      <NA>         73
## 4  2016     1     1 01:30          68 Clear   Clear      <NA>         68
## 5  2016     1     1 02:00          68 Clear   Clear      <NA>         68
## 6  2016     1     1 02:30          68 Clear   Clear      <NA>         68
## # ... with 1 more variable: date_utc <dttm>

Using mutate() create the column date with using make_date().

# Use make_date() to combine year, month and mday 
akl_hourly  <- akl_hourly_raw  %>% 
  mutate(date = make_date(year = year, month = month, day = mday))

We’ve pasted together the date and time columns. Create datetime by parsing the datetime_string column.

# Parse datetime_string 
akl_hourly <- akl_hourly  %>% 
  mutate(
    datetime_string = paste(date, time, sep = "T"),
    datetime = ymd_hms(datetime_string)
  )

Take a look at the date, time and datetime columns to verify they match up.

# Print date, time and datetime columns of akl_hourly
akl_hourly %>% select(date, time, datetime) %>% top_n(n=10)
## Selecting by datetime
## # A tibble: 10 x 3
##    date       time   datetime           
##    <date>     <time> <dttm>             
##  1 2016-12-31 18:30  2016-12-31 18:30:00
##  2 2016-12-31 19:00  2016-12-31 19:00:00
##  3 2016-12-31 19:30  2016-12-31 19:30:00
##  4 2016-12-31 20:00  2016-12-31 20:00:00
##  5 2016-12-31 20:30  2016-12-31 20:30:00
##  6 2016-12-31 21:00  2016-12-31 21:00:00
##  7 2016-12-31 21:30  2016-12-31 21:30:00
##  8 2016-12-31 22:00  2016-12-31 22:00:00
##  9 2016-12-31 22:30  2016-12-31 22:30:00
## 10 2016-12-31 23:30  2016-12-31 23:30:00

Take a look at the data by plotting datetime on the x-axis and temperature of the y-axis.

# Plot to check work
ggplot(akl_hourly, aes(x = datetime, y = temperature)) +
  geom_line()

Extracting parts of a datetime

What can you extract?

As you saw in the video, components of a datetime can be extracted by lubridate functions with the same name like year(), month(), day(), hour(), minute() and second(). They all work the same way just pass in a datetime or vector of datetimes.

There are also a few useful functions that return other aspects of a datetime like if it occurs in the morning am(), during daylight savings dst(), in a leap_year(), or which quarter() or semester() it occurs in.

Try them out by exploring the release times of R versions using the data from Chapter 1.

EXERCISES:

We’ve put release_time, the datetime column of the releases dataset from Chapter 1, in your workspace.

release_time <- releases$datetime

Examine the head() of release_time to verify this is a vector of datetimes.

# Examine the head() of release_time
head(release_time)
## [1] "1997-12-04T08:47:58Z" "1997-12-21T13:09:22Z" "1998-01-10T00:31:55Z"
## [4] "1998-03-14T19:25:55Z" "1998-05-02T07:58:17Z" "1998-06-14T12:56:20Z"

Extract the month from release_time and examine the first few with head().

# Examine the head() of the months of release_time
head(month(release_time))
## [1] 12 12  1  3  5  6

To see which months have most releases, extract the month then pipe to table().

# Extract the month of releases 
month(release_time) %>% table()
## .
##  1  2  3  4  5  6  7  8  9 10 11 12 
##  5  6  8 18  5 16  4  7  2 15  6 13

Repeat, to see which years have the most releases.

# Extract the year of releases
year(release_time) %>% table()
## .
## 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 
##    2   10    9    6    6    5    5    4    4    4    4    6    5    4    6    4 
## 2013 2014 2015 2016 2017 
##    4    4    5    5    3

Do releases happen in the morning (UTC)? Find out if the hour of a release is less than 12 and summarise with mean().

# How often is the hour before 12 (noon)?
mean(hour(release_time) < 12)
## [1] 1

Alternatively use am() to find out how often releases happen in the morning.

# How often is the release in am?
mean(am(release_time))
## [1] 1

Adding useful labels

In the previous exercise you found the month of releases:

head(month(release_time)) and received numeric months in return. Sometimes it’s nicer (especially for plotting or tables) to have named months. Both the month() and wday() (day of the week) functions have additional arguments label and abbr to achieve just that. Set label = TRUE to have the output labelled with month (or weekday) names, and abbr = FALSE for those names to be written in full rather than abbreviated.

For example, try running:

head(month(release_time, label = TRUE, abbr = FALSE)) Practice by examining the popular days of the week for R releases.

EXERCISES.

releases is now a data frame with a column called datetime with the release time.

First, see what wday() does without labeling, by calling it on the datetime column of releases and tabulating the result. Do you know if 1 is Sunday or Monday?

library(ggplot2)

# Use wday() to tabulate release by day of the week
wday(releases$datetime) %>% table()
## .
##  1  2  3  4  5  6  7 
##  3 29  9 12 18 31  3

Repeat above, but now use labels by specifying the label argument. Better, right?

# Add label = TRUE to make table more readable
wday(releases$datetime, label = TRUE) %>% table()
## .
## do\\. lu\\. ma\\. mi\\. ju\\. vi\\. sá\\. 
##     3    29     9    12    18    31     3

Now store the labelled weekdays in a new column called wday.

# Create column wday to hold labelled week days
releases$wday <- wday(releases$datetime, label = TRUE)

Create a barchart of releases by weekday, facetted by the type of release.çç

# Plot barchart of weekday by type of release
ggplot(releases, aes(wday)) +
  geom_bar() +
  facet_wrap(~ type, ncol = 1, scale = "free_y")

Extracting for plotting

Extracting components from a datetime is particularly useful when exploring data. Earlier in the chapter you imported daily data for weather in Auckland, and created a time series plot of ten years of daily maximum temperature. While that plot gives you a good overview of the whole ten years, it’s hard to see the annual pattern.

In this exercise you’ll use components of the dates to help explore the pattern of maximum temperature over the year. The first step is to create some new columns to hold the extracted pieces, then you’ll use them in a couple of plots.

EXERCISESE.

Use mutate() to create three new columns: year, yday and month that respectively hold the same components of the date column. Don’t forget to label the months with their names.

library(ggplot2)
library(dplyr)
library(ggridges)

# Add columns for year, yday and month
akl_daily <- akl_daily %>%
  mutate(
    year = year(date),
    yday = yday(date),
    month = month(date, label = TRUE))

Create a plot of yday on the x-axis, max_temp of the y-axis where lines are grouped by year. Each year is a line on this plot, with the x-axis running from Jan 1 to Dec 31.

# Plot max_temp by yday for all years
ggplot(akl_daily, aes(x = yday, y = max_temp)) +
  geom_line(aes(group = year), alpha = 0.5)
## Warning: Removed 1 row(s) containing missing values (geom_path).

To take an alternate look, create a ridgeline plot(formerly known as a joyplot) with max_temp on the x-axis, month on the y-axis, using geom_density_ridges() from the ggridges package.

# Examine distribution of max_temp by month
ggplot(akl_daily, aes(x = max_temp, y = month, height = ..density..)) +
  geom_density_ridges(stat = "density")
## Warning: Removed 10 rows containing non-finite values (stat_density).

Extracting for filtering and summarizing

Another reason to extract components is to help with filtering observations or creating summaries. For example, if you are only interested in observations made on weekdays (i.e. not on weekends) you could extract the weekdays then filter out weekends, e.g. wday(date) %in% 2:6.

In the last exercise you saw that January, February and March were great times to visit Auckland for warm temperatures, but will you need a raincoat?

In this exercise you’ll find out! You’ll use the hourly data to calculate how many days in each month there was any rain during the day.

EXERCISES.

eate new columns for the hour and month of the observation from datetime. Make sure you label the month.

# Create new columns hour, month and rainy
akl_hourly <- akl_hourly %>%
  mutate(
    hour = hour(datetime),
    month = month(datetime, label = TRUE),
    rainy = weather == "Precipitation"
  )

Filter to just daytime observations, where the hour is greater than or equal to 8 and less than or equal to 22.

# Filter for hours between 8am and 10pm (inclusive)
akl_day <- akl_hourly %>% 
  filter(hour >= 8, hour <= 22)

Group the observations first by month, then by date, and summarise by using any() on the rainy column. This results in one value per day

# Summarise for each date if there is any rain
rainy_days <- akl_day %>% 
  group_by(month, date) %>%
  summarise(
    any_rain = any(rainy)
  )
## `summarise()` regrouping output by 'month' (override with `.groups` argument)

Summarise again by summing any_rain. This results in one value per month

# Summarise for each month, the number of days with rain
rainy_days %>% 
  summarise(
    days_rainy = sum(any_rain)
  )
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 12 x 2
##    month days_rainy
##    <ord>      <int>
##  1 ene           15
##  2 feb           13
##  3 mar           12
##  4 abr           15
##  5 may           21
##  6 jun           19
##  7 jul           22
##  8 ago           16
##  9 sep           25
## 10 oct           20
## 11 nov           19
## 12 dic           11

Rounding datetimes

Practice rounding

As you saw in the video, round_date() rounds a date to the nearest value, floor_date() rounds down, and ceiling_date() rounds up.

All three take a unit argument which specifies the resolution of rounding. You can specify “second”, “minute”, “hour”, “day”, “week”, “month”, “bimonth”, “quarter”, “halfyear”, or “year”. Or, you can specify any multiple of those units, e.g. “5 years”, “3 minutes” etc.

Try them out with the release datetime of R 3.4.1.

EXERCISES

Choose the right function and units to round r_3_4_1 down to the nearest day.

r_3_4_1 <- ymd_hms("2016-05-03 07:13:28 UTC")

# Round down to day
floor_date(r_3_4_1, unit = "day")
## [1] "2016-05-03 UTC"

Choose the right function and units to round r_3_4_1 to the nearest 5 minutes.

# Round to nearest 5 minutes
round_date(r_3_4_1, unit = "5 minutes")
## [1] "2016-05-03 07:15:00 UTC"

Choose the right function and units to round r_3_4_1 up to the nearest week.

# Round up to week 
ceiling_date(r_3_4_1, unit = "week")
## [1] "2016-05-08 UTC"

Find the time elapsed on the day of release at the time of release by subtracting r_3_4_1 rounded down to the day from r_3_4_1.

# Subtract r_3_4_1 rounded down to day
r_3_4_1 - floor_date(r_3_4_1, unit = "day")
## Time difference of 7.224444 hours

Rounding with the weather data

When is rounding useful? In a lot of the same situations extracting date components is useful. The advantage of rounding over extracting is that it maintains the context of the unit. For example, extracting the hour gives you the hour the datetime occurred, but you lose the day that hour occurred on (unless you extract that too), on the other hand, rounding to the nearest hour maintains the day, month and year.

As an example you’ll explore how many observations per hour there really are in the hourly Auckland weather data.

Exercises:

Create a new column called day_hour that is datetime rounded down to the nearest hour.

# Create day_hour, datetime rounded down to hour
akl_hourly <- akl_hourly %>%
  mutate(
    day_hour = floor_date(datetime, unit = "hour")
  )

Use count() on day_hour to count how many observations there are in each hour. What looks like the most common value?

# Count observations per hour  
Count <- akl_hourly %>% 
  count(day_hour) 
head(Count)
## # A tibble: 6 x 2
##   day_hour                n
##   <dttm>              <int>
## 1 2016-01-01 00:00:00     2
## 2 2016-01-01 01:00:00     2
## 3 2016-01-01 02:00:00     2
## 4 2016-01-01 03:00:00     2
## 5 2016-01-01 04:00:00     2
## 6 2016-01-01 05:00:00     2

Extend the pipeline, so that after counting, you filter for observations where n is not equal to 2.

# Find day_hours with n != 2  
count <- akl_hourly %>% 
  count(day_hour) %>%
  filter(n != 2) %>% 
  arrange(desc(n)) %>% top_n(n=10)
## Selecting by n
head(count)
## # A tibble: 6 x 2
##   day_hour                n
##   <dttm>              <int>
## 1 2016-04-03 02:00:00     4
## 2 2016-09-25 00:00:00     4
## 3 2016-06-26 09:00:00     1
## 4 2016-09-01 23:00:00     1
## 5 2016-09-02 01:00:00     1
## 6 2016-09-04 11:00:00     1

Arithmetic with Dates and Times

Taking differences of datetimes

How long has it been?

To get finer control over a difference between datetimes use the base function difftime(). For example instead of time1 - time2, you use difftime(time1, time2).

difftime() takes an argument units which specifies the units for the difference. Your options are “secs”, “mins”, “hours”, “days”, or “weeks”.

To practice you’ll find the time since the first man stepped on the moon. You’ll also see the lubridate functions today() and now() which when called with no arguments return the current date and time in your system’s timezone.

EXERCISES

Apollo 11 landed on July 20, 1969. Use difftime() to find the number of days between today() and date_landing.

# The date of landing and moment of step
date_landing <- mdy("July 20, 1969")
moment_step <- mdy_hms("July 20, 1969, 02:56:15", tz = "UTC")
# How many days since the first man on the moon?
difftime(today(), date_landing, units = "days")
## Time difference of 18823 days

Neil Armstrong stepped onto the surface at 02:56:15 UTC. Use difftime() to find the number of seconds between now() and moment_step.

# How many seconds since the first man on the moon?
difftime(now(), moment_step, units = "secs")
## Time difference of 1626335541 secs

How many seconds are in a day?

How many seconds are in a day? There are 24 hours in a day, 60 minutes in an hour, and 60 seconds in a minute, so there should be 246060 = 86400 seconds, right?

Not always! In this exercise you’ll see a counter example, can you figure out what is going on?

EXERCISES:

We’ve put code to define three times in your script - noon on March 11th, March 12th, and March 13th in 2017 in the US Pacific timezone.

# Three dates
mar_11 <- ymd_hms("2017-03-11 12:00:00", 
  tz = "America/Los_Angeles")
mar_12 <- ymd_hms("2017-03-12 12:00:00", 
  tz = "America/Los_Angeles")
mar_13 <- ymd_hms("2017-03-13 12:00:00", 
  tz = "America/Los_Angeles")

Find the difference in time between mar_13 and mar_12 in seconds. This should match your intuition.

# Difference between mar_13 and mar_12 in seconds
difftime(mar_13, mar_12, units = "secs")
## Time difference of 86400 secs

Now, find the difference in time between mar_12 and mar_11 in seconds. Surprised?

# Difference between mar_12 and mar_11 in seconds
difftime(mar_12, mar_11, units = "secs")
## Time difference of 82800 secs

Time spans.

Adding or subtracting a time span to a datetime

A common use of time spans is to add or subtract them from a moment in time. For, example to calculate the time one day in the future from mar_11 (from the previous exercises), you could do either of:

mar_11 + days(1) mar_11 + ddays(1) Try them in the console, you get different results! But which one is the right one? It depends on your intent. If you want to account for the fact that time units, in this case days, have different lengths (i.e. due to daylight savings), you want a period days(). If you want the time 86400 seconds in the future you use a duration ddays().

In this exercise you’ll add and subtract timespans from dates and datetimes.

EXERCISES:

It’s Monday Aug 27th 2018 at 2pm and you want to remind yourself this time next week to send an email. Add a period of one week to mon_2pm.

# Add a period of one week to mon_2pm
mon_2pm <- dmy_hm("27 Aug 2018 14:00")
mon_2pm + weeks(1)
## [1] "2018-09-03 14:00:00 UTC"

It’s Tuesday Aug 28th 2018 at 9am and you are starting some code that usually takes about 81 hours to run. When will it finish? Add a duration of 81 hours to tue_9am.

# Add a duration of 81 hours to tue_9am
tue_9am <- dmy_hm("28 Aug 2018 9:00")
tue_9am + hours(81)
## [1] "2018-08-31 18:00:00 UTC"

What were you doing five years ago? Subtract a period of 5 years from today().

# Subtract a period of five years from today()
today() - years(5)
## [1] "2016-01-31"

Subtract a duration of 5 years from today(). Will this give a different date?

# Subtract a duration of five years from today()
today() - dyears(5)
## [1] "2016-01-31 18:00:00 UTC"

Arithmetic with timespans

You can add and subtract timespans to create different length timespans, and even multiply them by numbers. For example, to create a duration of three days and three hours you could do: ddays(3) + dhours(3), or 3ddays(1) + 3dhours(1) or even 3*(ddays(1) + dhours(1)).

There was an eclipse over North America on 2017-08-21 at 18:26:40. It’s possible to predict the next eclipse with similar geometry by calculating the time and date one Saros in the future. A Saros is a length of time that corresponds to 223 Synodic months, a Synodic month being the period of the Moon’s phases, a duration of 29 days, 12 hours, 44 minutes and 3 seconds.

Do just that in this exercise!

EXERCISES:

Create a duration corresponding to one Synodic Month: 29 days, 12 hours, 44 minutes and 3 seconds.

# Time of North American Eclipse 2017
eclipse_2017 <- ymd_hms("2017-08-21 18:26:40")
# Duration of 29 days, 12 hours, 44 mins and 3 secs
synodic <- ddays(29) + dhours(12) + dminutes(44) + dseconds(3)

Create a duration corresponding to one Saros by multiplying synodic by 223.

# 223 synodic months
saros <- 223*synodic

Add saros to eclipse_2017 to predict the next eclipse.

# Add saros to eclipse_2017
eclipse_2017 + saros
## [1] "2035-09-02 02:09:49 UTC"

Generating sequences of datetimes

By combining addition and multiplication with sequences you can generate sequences of datetimes. For example, you can generate a sequence of periods from 1 day up to 10 days with,

1:10 * days(1) Then by adding this sequence to a specific datetime, you can construct a sequence of datetimes from 1 day up to 10 days into the future

today() + 1:10 * days(1) You had a meeting this morning at 8am and you’d like to have that meeting at the same time and day every two weeks for a year. Generate

EXERCISES:

Create today_8am() by adding a period of 8 hours to today()

# Add a period of 8 hours to today
today_8am <- today() + hours(8) 
# Sequence of two weeks from 1 to 26
every_two_weeks <- 1:26 * weeks(2)

Create a sequence of periods from one period of two weeks, up to 26 periods of two weeks. Add every_two_weeks to today_8am.

# Create datetime for every two weeks for a year
today_8am + every_two_weeks
##  [1] "2021-02-14 08:00:00 UTC" "2021-02-28 08:00:00 UTC"
##  [3] "2021-03-14 08:00:00 UTC" "2021-03-28 08:00:00 UTC"
##  [5] "2021-04-11 08:00:00 UTC" "2021-04-25 08:00:00 UTC"
##  [7] "2021-05-09 08:00:00 UTC" "2021-05-23 08:00:00 UTC"
##  [9] "2021-06-06 08:00:00 UTC" "2021-06-20 08:00:00 UTC"
## [11] "2021-07-04 08:00:00 UTC" "2021-07-18 08:00:00 UTC"
## [13] "2021-08-01 08:00:00 UTC" "2021-08-15 08:00:00 UTC"
## [15] "2021-08-29 08:00:00 UTC" "2021-09-12 08:00:00 UTC"
## [17] "2021-09-26 08:00:00 UTC" "2021-10-10 08:00:00 UTC"
## [19] "2021-10-24 08:00:00 UTC" "2021-11-07 08:00:00 UTC"
## [21] "2021-11-21 08:00:00 UTC" "2021-12-05 08:00:00 UTC"
## [23] "2021-12-19 08:00:00 UTC" "2022-01-02 08:00:00 UTC"
## [25] "2022-01-16 08:00:00 UTC" "2022-01-30 08:00:00 UTC"

Problems in practice

Time zones

Setting the timezone

If you import a datetime and it has the wrong timezone, you can set it with force_tz(). Pass in the datetime as the first argument and the appropriate timezone to the tzone argument. Remember the timezone needs to be one from OlsonNames().

I wanted to watch New Zealand in the Women’s World Cup Soccer games in 2015, but the times listed on the FIFA website were all in times local to the venues. In this exercise you’ll help me set the timezones, then in the next exercise you’ll help me figure out what time I needed to tune in to watch them.

EXERCISES:

I’ve put the times as listed on the FIFA website for games 2 and 3 in the group stage for New Zealand in your code.

Game 2 was played in Edmonton. Use force_tz() to set the timezone of game 2 to “America/Edmonton”.

# Game2: CAN vs NZL in Edmonton
game2 <- mdy_hm("June 11 2015 19:00")

Game 3 was played in Winnipeg. Use force_tz() to set the timezone of game 3 to “America/Winnipeg”

# Game3: CHN vs NZL in Winnipeg
game3 <- mdy_hm("June 15 2015 18:30")

Find out how long the team had to rest between the two games, by using as.period() on the interval between game2_local and game3_local.

# Set the timezone to "America/Edmonton"
game2_local <- force_tz(game2, tzone = "America/Edmonton")
game2_local
## [1] "2015-06-11 19:00:00 MDT"
# Set the timezone to "America/Winnipeg"
game3_local <- force_tz(game3, tzone = "America/Winnipeg")
game3_local
## [1] "2015-06-15 18:30:00 CDT"
# How long does the team have to rest?
as.period(game2_local %--% game3_local)
## [1] "3d 22H 30M 0S"

Viewing in a timezone

To view a datetime in another timezone use with_tz(). The syntax of with_tz() is the same as force_tz(), passing a datetime and set the tzone argument to the desired timezone. Unlike force_tz(), with_tz() isn’t changing the underlying moment of time, just how it is displayed.

For example, the difference between now() displayed in the “America/Los_Angeles” timezone and “Pacific/Auckland” timezone is 0:

now <- now() with_tz(now, “America/Los_Angeles”) - with_tz(now, “Pacific/Auckland”) Help me figure out when to tune into the games from the previous exercise.

EXERCISES:

Most fans will tune in from New Zealand. Use with_tz() to display game2_local in New Zealand time. New Zealand is in the “Pacific/Auckland” timezone.

# What time is game2_local in NZ?
with_tz(game2_local, tzone = "Pacific/Auckland")
## [1] "2015-06-12 13:00:00 NZST"

I’ll be in Corvallis, Oregon. Use with_tz() to display game2_local my time. Corvallis is in the “America/Los_Angeles” timezone.

# What time is game2_local in Corvallis, Oregon?
with_tz(game2_local, tzone = "America/Los_Angeles")
## [1] "2015-06-11 18:00:00 PDT"

Finally, use with_tz() to display game3_local in New Zealand time.

# What time is game3_local in NZ?
with_tz(game3_local, tzone = "Pacific/Auckland")
## [1] "2015-06-16 11:30:00 NZST"

Timezones in the weather data

Did you ever notice that in the hourly Auckland weather data there was another datetime column, date_utc? Take a look:

tibble::glimpse(akl_hourly) The datetime column you created represented local time in Auckland, NZ. I suspect this additional column, date_utc represents the observation time in UTC (the name seems a big clue). But does it really?

Use your new timezone skills to find out.

EXERCISES:

The data is available in the akl_hourly data frame.

What timezone are datetime and date_utc currently in? Examine the head of the datetime and date_utc columns to find out.

# Examine datetime and date_utc columns
head(akl_hourly$datetime)
## [1] "2016-01-01 00:00:00 UTC" "2016-01-01 00:30:00 UTC"
## [3] "2016-01-01 01:00:00 UTC" "2016-01-01 01:30:00 UTC"
## [5] "2016-01-01 02:00:00 UTC" "2016-01-01 02:30:00 UTC"
head(akl_hourly$date_utc)
## [1] "2015-12-31 11:00:00 UTC" "2015-12-31 11:30:00 UTC"
## [3] "2015-12-31 12:00:00 UTC" "2015-12-31 12:30:00 UTC"
## [5] "2015-12-31 13:00:00 UTC" "2015-12-31 13:30:00 UTC"

Fix datetime to have the timezone for “Pacific/Auckland”.

# Force datetime to Pacific/Auckland
akl_hourly <- akl_hourly %>%
  mutate(
    datetime = force_tz(datetime, tzone = "Pacific/Auckland"))

Reexamine the head of the datetime column to check the times have the same clocktime, but are now in the right timezone.

# Reexamine datetime
head(akl_hourly$datetime)
## [1] "2016-01-01 00:00:00 NZDT" "2016-01-01 00:30:00 NZDT"
## [3] "2016-01-01 01:00:00 NZDT" "2016-01-01 01:30:00 NZDT"
## [5] "2016-01-01 02:00:00 NZDT" "2016-01-01 02:30:00 NZDT"

Now tabulate up the difference between the datetime and date_utc columns. It should be zero if our hypothesis was correct.

# Are datetime and date_utc the same moments
table(akl_hourly$datetime - akl_hourly$date_utc)
## 
## -82800      0   3600 
##      2  17450      2

Times without dates

For this entire course, if you’ve ever had a time, it’s always had an accompanying date, i.e. a datetime. But sometimes you just have a time without a date.

If you find yourself in this situation, the hms package provides an hms class of object for holding times without dates, and the best place to start would be with as.hms().

In fact, you’ve already seen an object of the hms class, but I didn’t point it out to you. Take a look in this exercise.

EXERCISES:

Use read_csv() to read in “akl_weather_hourly_2016.csv”. readr knows about the hms class, so if it comes across something that looks like a time it will use it.

# Examine structure of time column
str(akl_hourly$time)
##  'hms' num [1:17454] 00:00:00 00:30:00 01:00:00 01:30:00 ...
##  - attr(*, "units")= chr "secs"

In this case the time column has been parsed as a time without a date. Take a look at the structure of the time column to verify it has the class hms.

# Examine head of time column
head(akl_hourly$time)
## 00:00:00
## 00:30:00
## 01:00:00
## 01:30:00
## 02:00:00
## 02:30:00

hms objects print like times should. Take a look by examining the head of the time column. You can use hms objects in plots too. Create a plot with time on the x-axis, temperature on the y-axis, with lines grouped by date.

# A plot using just time
ggplot(akl_hourly, aes(x = time, y = temperature)) +
  geom_line(aes(group = make_date(year, month, mday)), alpha = 0.2)

More on importing and exporting datetimes

Fast parsing with fasttime

The fasttime package provides a single function fastPOSIXct(), designed to read in datetimes formatted according to ISO 8601. Because it only reads in one format, and doesn’t have to guess a format, it is really fast!

You’ll see how fast in this exercise by comparing how fast it reads in the dates from the Auckland hourly weather data (over 17,000 dates) to lubridates ymd_hms().

To compare run times you’ll use the microbenchmark() function from the package of the same name. You pass in as many arguments as you want each being an expression to time.

EXERCISES:
library(microbenchmark)
library(fasttime)

We’ve loaded the datetimes from the Auckland hourly data as strings into the vector dates.

# Examine structure of dates
dates <- akl_hourly$date_utc
str(dates)
##  POSIXct[1:17454], format: "2015-12-31 11:00:00" "2015-12-31 11:30:00" "2015-12-31 12:00:00" ...

Examine the structure of dates to verify it is a string and in the ISO 8601 format.

# Use fastPOSIXct() to parse dates
fastPOSIXct(dates) %>% str()
##  POSIXct[1:17454], format: "2015-12-31 12:00:00" "2015-12-31 12:30:00" "2015-12-31 13:00:00" ...

Parse dates with fasttime and pipe to str() to verify fastPOSIXct parses them correctly. Now to compare timing, call microbenchmark where the first argument uses ymd_hms() to parse dates and the second uses fastPOSIXct().

# Compare speed of fastPOSIXct() to ymd_hms()
microbenchmark(
  ymd_hms = ymd_hms(dates),
  fasttime = fastPOSIXct(dates),
  times = 20)
## Unit: milliseconds
##      expr      min       lq     mean   median       uq      max neval
##   ymd_hms 162.9936 170.6140 184.9791 179.3906 197.4423 230.9287    20
##  fasttime 125.3810 131.3792 138.1465 137.1554 139.7645 175.1961    20

Fast parsing with lubridate::fast_strptime

lubridate provides its own fast datetime parser: fast_strptime(). Instead of taking an order argument like parse_date_time() it takes a format argument and the format must comply with the strptime() style.

As you saw in the video that means any character that represents a datetime component must be prefixed with a % and any non-whitespace characters must be explicitly included.

Try parsing dates with fast_strptime() and then compare its speed to the other methods you’ve seen.

EXERCISES:

dates is in your workspace again.

# Head of dates
head(dates)
## [1] "2015-12-31 11:00:00 UTC" "2015-12-31 11:30:00 UTC"
## [3] "2015-12-31 12:00:00 UTC" "2015-12-31 12:30:00 UTC"
## [5] "2015-12-31 13:00:00 UTC" "2015-12-31 13:30:00 UTC"

Examine the head of dates. What components are present? What separators are used?

# Parse dates with fast_strptime
#fast_strptime(dates, 
  #  format = "%Y-%m-%dT%H:%M:%SZ") %>% str()

Parse dates with fast_strptime() by specifying the appropriate format string. Compare the timing of fast_strptime() to fasttime and ymd_hms().

# Comparse speed to ymd_hms() and fasttime
#microbenchmark(
 # ymd_hms = ymd_hms(dates),
 # fasttime = fastPOSIXct(dates),
  #fast_strptime = fast_strptime(dates, 
   # format = "%Y-%m-%dT%H:%M:%SZ"),
 # times = 20)

Outputting pretty dates and times

An easy way to output dates is to use the stamp() function in lubridate. stamp() takes a string which should be an example of how the date should be formatted, and returns a function that can be used to format dates.

In this exercise you’ll practice outputting today() in a nice way.

EXERCISES:

Create a stamp() based on the string “Saturday, Jan 1, 2000”.

# Create a stamp based on "Saturday, Jan 1, 2000"
date_stamp <- stamp("Saturday, Jan 1, 2000")
## Multiple formats matched: "Saturday, Jan %Om, %Y"(1), "Saturday, %Om %d, %Y"(1), "Saturday, Jan %m, %Y"(1)
## Using: "Saturday, %Om %d, %Y"

Print date_stamp. Notice it is a function.

# Print date_stamp
head(date_stamp)
##                                                             
## 1 function (x, locale = "Spanish_Spain.1252")               
## 2 {                                                         
## 3     {                                                     
## 4         old_lc_time <- Sys.getlocale("LC_TIME")           
## 5         if (old_lc_time != locale) {                      
## 6             on.exit(Sys.setlocale("LC_TIME", old_lc_time))

Pass today() to date_stamp to format today’s date.

# Call date_stamp on today()
date_stamp(today())
## [1] "Saturday, 01 31, 2021"

Now output today’s date in American style MM/DD/YYYY.

# Create and call a stamp based on "12/31/1999"
stamp("12/31/1999")(today())
## Multiple formats matched: "%Om/%d/%Y"(1), "%m/%d/%Y"(1)
## Using: "%Om/%d/%Y"
## [1] "01/31/2021"

12.- Introduction to Writing Functions in R

How to write a function

Calling functions

One way to make your code more readable is to be careful about the order you pass arguments when you call functions, and whether you pass the arguments by position or by name.

gold_medals, a numeric vector of the number of gold medals won by each country in the 2016 Summer Olympics, is provided.

For convenience, the arguments of median() and rank() are displayed using args(). Setting rank()’s na.last argument to “keep” means “keep the rank of NA values as NA”.

Best practice for calling functions is to include them in the order shown by args(), and to only name rare arguments.

gold_medals <- read.csv("./Data/gold_medals.csv", sep = ";")

EXERCISES:

The final line calculates the median number of gold medals each country won.

# Look at the gold medals data

head(gold_medals)
##   ï..USA GBR CH RUS GER JPN
## 1     46  27 26  19  17  12
colnames(gold_medals)[1] <- "USA"
# Note the arguments to median()
args(median)
## function (x, na.rm = FALSE, ...) 
## NULL

Rewrite the call to median(), following best practices.

# Note the arguments to rank()
args(rank)
## function (x, na.last = TRUE, ties.method = c("average", "first", 
##     "last", "random", "max", "min")) 
## NULL
# Rewrite this function call, following best practices
rank(-gold_medals, na.last = "keep",ties.method = "min")
## USA GBR  CH RUS GER JPN 
##   1   2   3   4   5   6

Converting scripts into functions

Your first function: tossing a coin

Time to write your first function! It’s a really good idea when writing functions to start simple. You can always make a function more complicated later if it’s really necessary, so let’s not worry about arguments for now.

EXERCISES:

Simulate a single coin toss by using sample() to sample from coin_sides once.

coin_sides <- c("head", "tail")

# Sample from coin_sides once
sample(coin_sides, size = 1)
## [1] "tail"

Write a template for your function, naming it toss_coin. The function should take no arguments. Don’t include the body of the function yet.

# Write a template for your function, toss_coin()
toss_coin <- function() {
  # (Leave the contents of the body for later)
# Add punctuation to finish the body
} 

Copy your script, and paste it into the function body.

# Your script, from a previous step
coin_sides <- c("head", "tail")
sample(coin_sides, 1)
## [1] "head"
# Paste your script into the function body
toss_coin <- function() {
  sample(coin_sides, 1)
  
}

Call your function.

# Your functions, from previous steps
toss_coin <- function() {
  coin_sides <- c("head", "tail")
  sample(coin_sides, 1)
}
# Call your function
toss_coin()
## [1] "tail"

Inputs to functions

Most functions require some sort of input to determine what to compute. The inputs to functions are called arguments. You specify them inside the parentheses after the word “function.”

As mentioned in the video, the following exercises assume that you are using sample() to do random sampling.

EXERCISES:

Sample from coin_sides n_flips times with replacement.

coin_sides <- c("head", "tail")
n_flips <- 10
# Sample from coin_sides n_flips times with replacement
sample(coin_sides, n_flips, replace = TRUE)
##  [1] "tail" "tail" "head" "tail" "tail" "head" "tail" "head" "head" "tail"

Update the definition of toss_coin() to accept a single argument, n_flips. The function should sample coin_sides n_flips times with replacement. Remember to change the signature and the body.

# Update the function to return n_flips coin tosses
toss_coin <- function(n_flips) {
  coin_sides <- c("head", "tail")
  sample(coin_sides, n_flips, replace = TRUE)
}

Generate 10 coin flips.

# Generate 10 coin tosses
toss_coin(10)
##  [1] "tail" "tail" "head" "head" "tail" "head" "head" "head" "tail" "tail"

Multiple inputs to functions

If a function should have more than one argument, list them in the function signature, separated by commas.

To solve this exercise, you need to know how to specify sampling weights to sample(). Set the prob argument to a numeric vector with the same length as x. Each value of prob is the probability of sampling the corresponding element of x, so their values add up to one. In the following example, each sample has a 20% chance of “bat”, a 30% chance of “cat” and a 50% chance of “rat”.

sample(c(“bat”, “cat”, “rat”), 10, replace = TRUE, prob = c(0.2, 0.3, 0.5))

EXERCISE:

Update the definition of toss_coin() so it accepts an argument, p_head, and weights the samples using the code you wrote in the previous step.

# Update the function so heads have probability p_head
toss_coin <- function(n_flips, p_head) {
  coin_sides <- c("head", "tail")
  # Define a vector of weights
  weights <- c(p_head, 1 - p_head)
  # Modify the sampling to be weighted 
  sample(coin_sides, n_flips, replace = TRUE, prob = weights)
}

Generate 10 coin tosses with an 80% chance of each head.

toss_coin(10, p_head = 0.8)
##  [1] "tail" "head" "head" "head" "head" "head" "head" "head" "head" "head"

Y kant I reed ur code?

Renaming GLM

R’s generalized linear regression function, glm(), suffers the same usability problems as lm(): its name is an acronym, and its formula and data arguments are in the wrong order.

To solve this exercise, you need to know two things about generalized linear regression:

glm() formulas are specified like lm() formulas: response is on the left, and explanatory variables are added on the right. To model count data, set glm()’s family argument to poisson, making it a Poisson regression. Here you’ll use data on the number of yearly visits to Snake River at Jackson Hole, Wyoming, snake_river_visits.

snake_river_visits <- readRDS("./Data/snake_river_visits.rds")

EXERCISE:

Run a generalized linear regression by calling glm(). Model n_visits vs. gender, income, and travel on the snake_river_visits dataset, setting the family to poisson.

# Run a generalized linear regression 
glm(
  # Model no. of visits vs. gender, income, travel
  n_visits ~ gender + income + travel, 
  # Use the snake_river_visits dataset
  data = snake_river_visits, 
  # Make it a Poisson regression
  family = "poisson"
)
## 
## Call:  glm(formula = n_visits ~ gender + income + travel, family = "poisson", 
##     data = snake_river_visits)
## 
## Coefficients:
##       (Intercept)       genderfemale  income($25k,$55k]  income($55k,$95k]  
##            4.0864             0.3740            -0.0199            -0.5807  
## income($95k,$Inf)   travel(0.25h,4h]    travel(4h,Infh)  
##           -0.5782            -0.6271            -2.4230  
## 
## Degrees of Freedom: 345 Total (i.e. Null);  339 Residual
##   (64 observations deleted due to missingness)
## Null Deviance:       18850 
## Residual Deviance: 11530     AIC: 12860

Define a function, run_poisson_regression(), to run a Poisson regression. This should take two arguments: data and formula, and call glm(), passing those arguments and setting family to poisson.

# Write a function to run a Poisson regression
run_poisson_regression <- function(data, formula){
  glm(formula, data, family = poisson)
}

Recreate the Poisson regression model from the first step, this time by calling your run_poisson_regression() function

# From previous step
run_poisson_regression <- function(data, formula) {
  glm(formula, data, family = poisson)
}

# Re-run the Poisson regression, using your function
model <- snake_river_visits %>%
  run_poisson_regression(n_visits ~ gender + income + travel)

All about arguments

Default arguments

Numeric defaults

cut_by_quantile() converts a numeric vector into a categorical variable where quantiles define the cut points. This is a useful function, but at the moment you have to specify five arguments to make it work. This is too much thinking and typing.

By specifying default arguments, you can make it easier to use. Let’s start with n, which specifies how many categories to cut x into.

A numeric vector of the number of visits to Snake River is provided as n_visits.

EXERCISES:

Update the definition of cut_by_quantile() so that the n argument defaults to 5. Remove the n argument from the call to cut_by_quantile().

loomis <- read.csv("./Data/loomis.csv", sep = ",")
# quantile function
cut_by_quantile <- function(x, n, na.rm, labels, interval_type) {
  probs <- seq(0, 1, length.out = n + 1)
  qtiles <- quantile(x, probs, na.rm = na.rm, names = FALSE)
  right <- switch(interval_type, "(lo, hi]" = TRUE, "[lo, hi)" = FALSE)
  cut(x, qtiles, labels = labels, right = right, include.lowest = TRUE)
}
loomis$anvisits[is.na(loomis$anvisits)] <- 0
loomis$anvisits
##   [1]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##  [19]   0  12 100  35   1   6   2   1   1   1   1   1   1 100  80 104  55 350
##  [37]  20  60 250 100  50  40   9 200 200 100   8   6   2  15  12  30 120  52
##  [55]  35  30  75  10 250  15   4  25  50 114  50 100  15  30 120  30   0 160
##  [73]  12  25   3  15  14  15   8 125  96 260  25  30  30   1  50   6  12  72
##  [91]  20  25  50  30   1   1   5   1   3   6  50  10   9   4  12   2  15  50
## [109]   7 100  10  50   2  50   2 100  30   1   1   1   1   2   1   1   1   1
## [127]   1   1   4   1   1   2   2   1   1   2   1   1   1   2   2  10   3   3
## [145]   4   5   1   1   2   1   2   6   1   1   1   1 200  13 150  25  10  40
## [163]  10   1  30   6  35  24 100  17  20  40  52  15  60  30  20   6  70  35
## [181]  30  24 300 100   0  30  50  26  17   6 100  30  50  20 150  20  70  20
## [199] 100   1  20 100  50   0   1  10  60  10   2   7  24   6   0   4  90  20
## [217]  75  12  70  22   2  20   3  52   5  25  25  30   1  12  20  20  50  25
## [235]   3   3   1   1   1   1   7   2   1   0   1   3  15   1   1   3   1   1
## [253]   1   1   1   1   1   1   1   1 150   5   0 150   3  40   5   2  50   6
## [271]   6   6   2  25   6  25  50   3 150  60   3   4  50  14   4  60  30   3
## [289]   2   1  10   1   1   1   1   1   2   1   1   1   1   1   1   1   1   2
## [307]   1   1   1   1   2   1   1   2   1   1   1   1   2   1   1   1   1   1
## [325]   2   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
## [343]   1   1   1   8   1   1   1   1   1   1   1   1   1   1   1   1   2   1
## [361]   1   1   1   1   1   1   1   1   1   1   2   1   0   1  50  30  40 208
## [379]  50  20 150  50  80  75   6  10   6  26  60  30  30  15  12  30  20  20
## [397] 120  15  75   4  35   2  30  76   2   1   3   1   3   2
# Set the default for n to 5
cut_by_quantile <- function(x, n = 5, na.rm, labels, interval_type) {
  probs <- seq(0, 1, length.out = n + 1)
  qtiles <- quantile(x, probs, na.rm = na.rm, names = FALSE)
  right <- switch(interval_type, "(lo, hi]" = TRUE, "[lo, hi)" = FALSE)
  cut(x, qtiles, labels = labels, right = right, include.lowest = TRUE)
}
# Remove the n argument from the call
cut_by_quantile(
  loomis$anvisits, 
  na.rm = FALSE, 
  labels = c("very low", "low", "medium", "high", "very high"),
  interval_type = "(lo, hi]"
)
##   [1] very low  very low  very low  very low  very low  very low  very low 
##   [8] very low  very low  very low  very low  very low  very low  very low 
##  [15] very low  very low  very low  very low  very low  high      very high
##  [22] high      very low  medium    low       very low  very low  very low 
##  [29] very low  very low  very low  very high very high very high very high
##  [36] very high high      very high very high very high very high very high
##  [43] medium    very high very high very high medium    medium    low      
##  [50] high      high      high      very high very high high      high     
##  [57] very high medium    very high high      medium    high      very high
##  [64] very high very high very high high      high      very high high     
##  [71] very low  very high high      high      medium    high      high     
##  [78] high      medium    very high very high very high high      high     
##  [85] high      very low  very high medium    high      very high high     
##  [92] high      very high high      very low  very low  medium    very low 
##  [99] medium    medium    very high medium    medium    medium    high     
## [106] low       high      very high medium    very high medium    very high
## [113] low       very high low       very high high      very low  very low 
## [120] very low  very low  low       very low  very low  very low  very low 
## [127] very low  very low  medium    very low  very low  low       low      
## [134] very low  very low  low       very low  very low  very low  low      
## [141] low       medium    medium    medium    medium    medium    very low 
## [148] very low  low       very low  low       medium    very low  very low 
## [155] very low  very low  very high high      very high high      medium   
## [162] very high medium    very low  high      medium    high      high     
## [169] very high high      high      very high very high high      very high
## [176] high      high      medium    very high high      high      high     
## [183] very high very high very low  high      very high high      high     
## [190] medium    very high high      very high high      very high high     
## [197] very high high      very high very low  high      very high very high
## [204] very low  very low  medium    very high medium    low       medium   
## [211] high      medium    very low  medium    very high high      very high
## [218] high      very high high      low       high      medium    very high
## [225] medium    high      high      high      very low  high      high     
## [232] high      very high high      medium    medium    very low  very low 
## [239] very low  very low  medium    low       very low  very low  very low 
## [246] medium    high      very low  very low  medium    very low  very low 
## [253] very low  very low  very low  very low  very low  very low  very low 
## [260] very low  very high medium    very low  very high medium    very high
## [267] medium    low       very high medium    medium    medium    low      
## [274] high      medium    high      very high medium    very high very high
## [281] medium    medium    very high high      medium    very high high     
## [288] medium    low       very low  medium    very low  very low  very low 
## [295] very low  very low  low       very low  very low  very low  very low 
## [302] very low  very low  very low  very low  low       very low  very low 
## [309] very low  very low  low       very low  very low  low       very low 
## [316] very low  very low  very low  low       very low  very low  very low 
## [323] very low  very low  low       very low  very low  very low  very low 
## [330] very low  very low  very low  very low  very low  very low  very low 
## [337] very low  very low  very low  very low  very low  very low  very low 
## [344] very low  very low  medium    very low  very low  very low  very low 
## [351] very low  very low  very low  very low  very low  very low  very low 
## [358] very low  low       very low  very low  very low  very low  very low 
## [365] very low  very low  very low  very low  very low  very low  low      
## [372] very low  very low  very low  very high high      very high very high
## [379] very high high      very high very high very high very high medium   
## [386] medium    medium    high      very high high      high      high     
## [393] high      high      high      high      very high high      very high
## [400] medium    high      low       high      very high low       very low 
## [407] medium    very low  medium    low      
## Levels: very low low medium high very high

Logical defaults

cut_by_quantile() is now slightly easier to use, but you still always have to specify the na.rm argument. This removes missing values – it behaves the same as the na.rm argument to mean() or sd().

Where functions have an argument for removing missing values, the best practice is to not remove them by default (in case you hadn’t spotted that you had missing values). That means that the default for na.rm should be FALSE.

EXERCISES:

Update the definition of cut_by_quantile() so that the na.rm argument defaults to FALSE.

# Set the default for na.rm to FALSE
cut_by_quantile <- function(x, n = 5, na.rm = FALSE, labels, interval_type) {
  probs <- seq(0, 1, length.out = n + 1)
  qtiles <- quantile(x, probs, na.rm = na.rm, names = FALSE)
  right <- switch(interval_type, "(lo, hi]" = TRUE, "[lo, hi)" = FALSE)
  cut(x, qtiles, labels = labels, right = right, include.lowest = TRUE)
}

Remove the na.rm argument from the call to cut_by_quantile().

# Remove the na.rm argument from the call
cut_by_quantile(
  loomis$anvisits, 
  labels = c("very low", "low", "medium", "high", "very high"),
  interval_type = "(lo, hi]"
)
##   [1] very low  very low  very low  very low  very low  very low  very low 
##   [8] very low  very low  very low  very low  very low  very low  very low 
##  [15] very low  very low  very low  very low  very low  high      very high
##  [22] high      very low  medium    low       very low  very low  very low 
##  [29] very low  very low  very low  very high very high very high very high
##  [36] very high high      very high very high very high very high very high
##  [43] medium    very high very high very high medium    medium    low      
##  [50] high      high      high      very high very high high      high     
##  [57] very high medium    very high high      medium    high      very high
##  [64] very high very high very high high      high      very high high     
##  [71] very low  very high high      high      medium    high      high     
##  [78] high      medium    very high very high very high high      high     
##  [85] high      very low  very high medium    high      very high high     
##  [92] high      very high high      very low  very low  medium    very low 
##  [99] medium    medium    very high medium    medium    medium    high     
## [106] low       high      very high medium    very high medium    very high
## [113] low       very high low       very high high      very low  very low 
## [120] very low  very low  low       very low  very low  very low  very low 
## [127] very low  very low  medium    very low  very low  low       low      
## [134] very low  very low  low       very low  very low  very low  low      
## [141] low       medium    medium    medium    medium    medium    very low 
## [148] very low  low       very low  low       medium    very low  very low 
## [155] very low  very low  very high high      very high high      medium   
## [162] very high medium    very low  high      medium    high      high     
## [169] very high high      high      very high very high high      very high
## [176] high      high      medium    very high high      high      high     
## [183] very high very high very low  high      very high high      high     
## [190] medium    very high high      very high high      very high high     
## [197] very high high      very high very low  high      very high very high
## [204] very low  very low  medium    very high medium    low       medium   
## [211] high      medium    very low  medium    very high high      very high
## [218] high      very high high      low       high      medium    very high
## [225] medium    high      high      high      very low  high      high     
## [232] high      very high high      medium    medium    very low  very low 
## [239] very low  very low  medium    low       very low  very low  very low 
## [246] medium    high      very low  very low  medium    very low  very low 
## [253] very low  very low  very low  very low  very low  very low  very low 
## [260] very low  very high medium    very low  very high medium    very high
## [267] medium    low       very high medium    medium    medium    low      
## [274] high      medium    high      very high medium    very high very high
## [281] medium    medium    very high high      medium    very high high     
## [288] medium    low       very low  medium    very low  very low  very low 
## [295] very low  very low  low       very low  very low  very low  very low 
## [302] very low  very low  very low  very low  low       very low  very low 
## [309] very low  very low  low       very low  very low  low       very low 
## [316] very low  very low  very low  low       very low  very low  very low 
## [323] very low  very low  low       very low  very low  very low  very low 
## [330] very low  very low  very low  very low  very low  very low  very low 
## [337] very low  very low  very low  very low  very low  very low  very low 
## [344] very low  very low  medium    very low  very low  very low  very low 
## [351] very low  very low  very low  very low  very low  very low  very low 
## [358] very low  low       very low  very low  very low  very low  very low 
## [365] very low  very low  very low  very low  very low  very low  low      
## [372] very low  very low  very low  very high high      very high very high
## [379] very high high      very high very high very high very high medium   
## [386] medium    medium    high      very high high      high      high     
## [393] high      high      high      high      very high high      very high
## [400] medium    high      low       high      very high low       very low 
## [407] medium    very low  medium    low      
## Levels: very low low medium high very high

NULL defaults

The cut() function used by cut_by_quantile() can automatically provide sensible labels for each category. The code to generate these labels is pretty complicated, so rather than appearing in the function signature directly, its labels argument defaults to NULL, and the calculation details are shown on the ?cut help page.

EXERCISES:

Update the definition of cut_by_quantile() so that the labels argument defaults to NULL.

# Set the default for labels to NULL
cut_by_quantile <- function(x, n = 5, na.rm = FALSE, labels = NULL, interval_type) {
  probs <- seq(0, 1, length.out = n + 1)
  qtiles <- quantile(x, probs, na.rm = na.rm, names = FALSE)
  right <- switch(interval_type, "(lo, hi]" = TRUE, "[lo, hi)" = FALSE)
  cut(x, qtiles, labels = labels, right = right, include.lowest = TRUE)
}

Remove the labels argument from the call to cut_by_quantile().

# Remove the labels argument from the call
cut_by_quantile(
  loomis$anvisits,
  interval_type = "(lo, hi]"
)
##   [1] [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]   
##   [9] [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]   
##  [17] [0,1]    [0,1]    [0,1]    (10,35]  (35,350] (10,35]  [0,1]    (2,10]  
##  [25] (1,2]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    (35,350]
##  [33] (35,350] (35,350] (35,350] (35,350] (10,35]  (35,350] (35,350] (35,350]
##  [41] (35,350] (35,350] (2,10]   (35,350] (35,350] (35,350] (2,10]   (2,10]  
##  [49] (1,2]    (10,35]  (10,35]  (10,35]  (35,350] (35,350] (10,35]  (10,35] 
##  [57] (35,350] (2,10]   (35,350] (10,35]  (2,10]   (10,35]  (35,350] (35,350]
##  [65] (35,350] (35,350] (10,35]  (10,35]  (35,350] (10,35]  [0,1]    (35,350]
##  [73] (10,35]  (10,35]  (2,10]   (10,35]  (10,35]  (10,35]  (2,10]   (35,350]
##  [81] (35,350] (35,350] (10,35]  (10,35]  (10,35]  [0,1]    (35,350] (2,10]  
##  [89] (10,35]  (35,350] (10,35]  (10,35]  (35,350] (10,35]  [0,1]    [0,1]   
##  [97] (2,10]   [0,1]    (2,10]   (2,10]   (35,350] (2,10]   (2,10]   (2,10]  
## [105] (10,35]  (1,2]    (10,35]  (35,350] (2,10]   (35,350] (2,10]   (35,350]
## [113] (1,2]    (35,350] (1,2]    (35,350] (10,35]  [0,1]    [0,1]    [0,1]   
## [121] [0,1]    (1,2]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]   
## [129] (2,10]   [0,1]    [0,1]    (1,2]    (1,2]    [0,1]    [0,1]    (1,2]   
## [137] [0,1]    [0,1]    [0,1]    (1,2]    (1,2]    (2,10]   (2,10]   (2,10]  
## [145] (2,10]   (2,10]   [0,1]    [0,1]    (1,2]    [0,1]    (1,2]    (2,10]  
## [153] [0,1]    [0,1]    [0,1]    [0,1]    (35,350] (10,35]  (35,350] (10,35] 
## [161] (2,10]   (35,350] (2,10]   [0,1]    (10,35]  (2,10]   (10,35]  (10,35] 
## [169] (35,350] (10,35]  (10,35]  (35,350] (35,350] (10,35]  (35,350] (10,35] 
## [177] (10,35]  (2,10]   (35,350] (10,35]  (10,35]  (10,35]  (35,350] (35,350]
## [185] [0,1]    (10,35]  (35,350] (10,35]  (10,35]  (2,10]   (35,350] (10,35] 
## [193] (35,350] (10,35]  (35,350] (10,35]  (35,350] (10,35]  (35,350] [0,1]   
## [201] (10,35]  (35,350] (35,350] [0,1]    [0,1]    (2,10]   (35,350] (2,10]  
## [209] (1,2]    (2,10]   (10,35]  (2,10]   [0,1]    (2,10]   (35,350] (10,35] 
## [217] (35,350] (10,35]  (35,350] (10,35]  (1,2]    (10,35]  (2,10]   (35,350]
## [225] (2,10]   (10,35]  (10,35]  (10,35]  [0,1]    (10,35]  (10,35]  (10,35] 
## [233] (35,350] (10,35]  (2,10]   (2,10]   [0,1]    [0,1]    [0,1]    [0,1]   
## [241] (2,10]   (1,2]    [0,1]    [0,1]    [0,1]    (2,10]   (10,35]  [0,1]   
## [249] [0,1]    (2,10]   [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]   
## [257] [0,1]    [0,1]    [0,1]    [0,1]    (35,350] (2,10]   [0,1]    (35,350]
## [265] (2,10]   (35,350] (2,10]   (1,2]    (35,350] (2,10]   (2,10]   (2,10]  
## [273] (1,2]    (10,35]  (2,10]   (10,35]  (35,350] (2,10]   (35,350] (35,350]
## [281] (2,10]   (2,10]   (35,350] (10,35]  (2,10]   (35,350] (10,35]  (2,10]  
## [289] (1,2]    [0,1]    (2,10]   [0,1]    [0,1]    [0,1]    [0,1]    [0,1]   
## [297] (1,2]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]   
## [305] [0,1]    (1,2]    [0,1]    [0,1]    [0,1]    [0,1]    (1,2]    [0,1]   
## [313] [0,1]    (1,2]    [0,1]    [0,1]    [0,1]    [0,1]    (1,2]    [0,1]   
## [321] [0,1]    [0,1]    [0,1]    [0,1]    (1,2]    [0,1]    [0,1]    [0,1]   
## [329] [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]   
## [337] [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]   
## [345] [0,1]    (2,10]   [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]   
## [353] [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    (1,2]    [0,1]   
## [361] [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]   
## [369] [0,1]    [0,1]    (1,2]    [0,1]    [0,1]    [0,1]    (35,350] (10,35] 
## [377] (35,350] (35,350] (35,350] (10,35]  (35,350] (35,350] (35,350] (35,350]
## [385] (2,10]   (2,10]   (2,10]   (10,35]  (35,350] (10,35]  (10,35]  (10,35] 
## [393] (10,35]  (10,35]  (10,35]  (10,35]  (35,350] (10,35]  (35,350] (2,10]  
## [401] (10,35]  (1,2]    (10,35]  (35,350] (1,2]    [0,1]    (2,10]   [0,1]   
## [409] (2,10]   (1,2]   
## Levels: [0,1] (1,2] (2,10] (10,35] (35,350]

Categorical defaults

When cutting up a numeric vector, you need to worry about what happens if a value lands exactly on a boundary. You can either put this value into a category of the lower interval or the higher interval. That is, you can choose your intervals to include values at the top boundary but not the bottom (in mathematical terminology, “open on the left, closed on the right”, or (lo, hi]). Or you can choose the opposite (“closed on the left, open on the right”, or [lo, hi)). cut_by_quantile() should allow these two choices.

The pattern for categorical defaults is:

function(cat_arg = c(“choice1”, “choice2”)) { cat_arg <- match.arg(cat_arg) } Free hint: In the console, type head(rank) to see the start of rank()’s definition, and look at the ties.method argument.

EXERCISES:

Update the signature of cut_by_quantile() so that the interval_type argument can be “(lo, hi]” or “[lo, hi)”. Note the space after each comma.

# Set the categories for interval_type to "(lo, hi]" and "[lo, hi)"
cut_by_quantile <- function(x, n = 5, na.rm = FALSE, labels = NULL, 
                            interval_type = c("(lo, hi]", "[lo, hi)")) {
  # Match the interval_type argument
  interval_type <- match.arg(interval_type)
  probs <- seq(0, 1, length.out = n + 1)
  qtiles <- quantile(x, probs, na.rm = na.rm, names = FALSE)
  right <- switch(interval_type, "(lo, hi]" = TRUE, "[lo, hi)" = FALSE)
  cut(x, qtiles, labels = labels, right = right, include.lowest = TRUE)
}

Update the body of cut_by_quantile() to match the interval_type argument.

Remove the interval_type argument from the call to cut_by_quantile().

# Remove the interval_type argument from the call
cut_by_quantile(loomis$anvisits)
##   [1] [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]   
##   [9] [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]   
##  [17] [0,1]    [0,1]    [0,1]    (10,35]  (35,350] (10,35]  [0,1]    (2,10]  
##  [25] (1,2]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    (35,350]
##  [33] (35,350] (35,350] (35,350] (35,350] (10,35]  (35,350] (35,350] (35,350]
##  [41] (35,350] (35,350] (2,10]   (35,350] (35,350] (35,350] (2,10]   (2,10]  
##  [49] (1,2]    (10,35]  (10,35]  (10,35]  (35,350] (35,350] (10,35]  (10,35] 
##  [57] (35,350] (2,10]   (35,350] (10,35]  (2,10]   (10,35]  (35,350] (35,350]
##  [65] (35,350] (35,350] (10,35]  (10,35]  (35,350] (10,35]  [0,1]    (35,350]
##  [73] (10,35]  (10,35]  (2,10]   (10,35]  (10,35]  (10,35]  (2,10]   (35,350]
##  [81] (35,350] (35,350] (10,35]  (10,35]  (10,35]  [0,1]    (35,350] (2,10]  
##  [89] (10,35]  (35,350] (10,35]  (10,35]  (35,350] (10,35]  [0,1]    [0,1]   
##  [97] (2,10]   [0,1]    (2,10]   (2,10]   (35,350] (2,10]   (2,10]   (2,10]  
## [105] (10,35]  (1,2]    (10,35]  (35,350] (2,10]   (35,350] (2,10]   (35,350]
## [113] (1,2]    (35,350] (1,2]    (35,350] (10,35]  [0,1]    [0,1]    [0,1]   
## [121] [0,1]    (1,2]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]   
## [129] (2,10]   [0,1]    [0,1]    (1,2]    (1,2]    [0,1]    [0,1]    (1,2]   
## [137] [0,1]    [0,1]    [0,1]    (1,2]    (1,2]    (2,10]   (2,10]   (2,10]  
## [145] (2,10]   (2,10]   [0,1]    [0,1]    (1,2]    [0,1]    (1,2]    (2,10]  
## [153] [0,1]    [0,1]    [0,1]    [0,1]    (35,350] (10,35]  (35,350] (10,35] 
## [161] (2,10]   (35,350] (2,10]   [0,1]    (10,35]  (2,10]   (10,35]  (10,35] 
## [169] (35,350] (10,35]  (10,35]  (35,350] (35,350] (10,35]  (35,350] (10,35] 
## [177] (10,35]  (2,10]   (35,350] (10,35]  (10,35]  (10,35]  (35,350] (35,350]
## [185] [0,1]    (10,35]  (35,350] (10,35]  (10,35]  (2,10]   (35,350] (10,35] 
## [193] (35,350] (10,35]  (35,350] (10,35]  (35,350] (10,35]  (35,350] [0,1]   
## [201] (10,35]  (35,350] (35,350] [0,1]    [0,1]    (2,10]   (35,350] (2,10]  
## [209] (1,2]    (2,10]   (10,35]  (2,10]   [0,1]    (2,10]   (35,350] (10,35] 
## [217] (35,350] (10,35]  (35,350] (10,35]  (1,2]    (10,35]  (2,10]   (35,350]
## [225] (2,10]   (10,35]  (10,35]  (10,35]  [0,1]    (10,35]  (10,35]  (10,35] 
## [233] (35,350] (10,35]  (2,10]   (2,10]   [0,1]    [0,1]    [0,1]    [0,1]   
## [241] (2,10]   (1,2]    [0,1]    [0,1]    [0,1]    (2,10]   (10,35]  [0,1]   
## [249] [0,1]    (2,10]   [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]   
## [257] [0,1]    [0,1]    [0,1]    [0,1]    (35,350] (2,10]   [0,1]    (35,350]
## [265] (2,10]   (35,350] (2,10]   (1,2]    (35,350] (2,10]   (2,10]   (2,10]  
## [273] (1,2]    (10,35]  (2,10]   (10,35]  (35,350] (2,10]   (35,350] (35,350]
## [281] (2,10]   (2,10]   (35,350] (10,35]  (2,10]   (35,350] (10,35]  (2,10]  
## [289] (1,2]    [0,1]    (2,10]   [0,1]    [0,1]    [0,1]    [0,1]    [0,1]   
## [297] (1,2]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]   
## [305] [0,1]    (1,2]    [0,1]    [0,1]    [0,1]    [0,1]    (1,2]    [0,1]   
## [313] [0,1]    (1,2]    [0,1]    [0,1]    [0,1]    [0,1]    (1,2]    [0,1]   
## [321] [0,1]    [0,1]    [0,1]    [0,1]    (1,2]    [0,1]    [0,1]    [0,1]   
## [329] [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]   
## [337] [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]   
## [345] [0,1]    (2,10]   [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]   
## [353] [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    (1,2]    [0,1]   
## [361] [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]    [0,1]   
## [369] [0,1]    [0,1]    (1,2]    [0,1]    [0,1]    [0,1]    (35,350] (10,35] 
## [377] (35,350] (35,350] (35,350] (10,35]  (35,350] (35,350] (35,350] (35,350]
## [385] (2,10]   (2,10]   (2,10]   (10,35]  (35,350] (10,35]  (10,35]  (10,35] 
## [393] (10,35]  (10,35]  (10,35]  (10,35]  (35,350] (10,35]  (35,350] (2,10]  
## [401] (10,35]  (1,2]    (10,35]  (35,350] (1,2]    [0,1]    (2,10]   [0,1]   
## [409] (2,10]   (1,2]   
## Levels: [0,1] (1,2] (2,10] (10,35] (35,350]

Passing arguments between functions

Harmonic mean

The harmonic mean is the reciprocal of the arithmetic mean of the reciprocal of the data. That is

The harmonic mean is often used to average ratio data. You’ll be using it on the price/earnings ratio of stocks in the Standard and Poor’s 500 index, provided as std_and_poor500. Price/earnings ratio is a measure of how expensive a stock is.

std_and_poor500 <- read.csv("./Data/std_and_poor500.csv", sep = "," , col.names = c("x", "symbol"  , "company" , "sector" ,  "industry", "pe_ratio"))

The dplyr package is loaded.

EXERCISES:

Look at std_and_poor500 (you’ll need this later). Write a function, get_reciprocal, to get the reciprocal of an input x. Its only argument should be x, and it should return one over x.

std_and_poor500 <- std_and_poor500[,c(-1)]
# Look at the Standard and Poor 500 data
glimpse(std_and_poor500)
## Rows: 505
## Columns: 5
## $ symbol   <chr> "MMM", "ABT", "ABBV", "ABMD", "ACN", "ATVI", "ADBE", "AMD"...
## $ company  <chr> "3M Company", "Abbott Laboratories", "AbbVie Inc.", "ABIOM...
## $ sector   <chr> "Industrials", "Health Care", "Health Care", "Health Care"...
## $ industry <chr> "Industrial Conglomerates", "Health Care Equipment", "Phar...
## $ pe_ratio <dbl> 18.31678, 57.66621, 22.43805, 45.63993, 27.00233, 20.13596...
# Write a function to calculate the reciprocal
get_reciprocal <- function(x) {
  1/x
}

Write a function, calc_harmonic_mean(), that calculates the harmonic mean of its only input, x.

# Write a function to calculate the harmonic mean
calc_harmonic_mean <- function(x) {
  x %>%
    get_reciprocal() %>%
    mean %>%
    get_reciprocal()
}

Using std_and_poor500, group by sector, and summarize to calculate the harmonic mean of the price/earning ratios in the pe_ratio column.

names(std_and_poor500)
## [1] "symbol"   "company"  "sector"   "industry" "pe_ratio"
std_and_poor500 %>% 
  # Group by sector
  group_by(sector) %>% 
  # Summarize, calculating harmonic mean of P/E ratio
  summarise(hmean_pe_ratio = calc_harmonic_mean(pe_ratio))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 11 x 2
##    sector                 hmean_pe_ratio
##    <chr>                           <dbl>
##  1 Communication Services           NA  
##  2 Consumer Discretionary           NA  
##  3 Consumer Staples                 NA  
##  4 Energy                           NA  
##  5 Financials                       NA  
##  6 Health Care                      NA  
##  7 Industrials                      NA  
##  8 Information Technology           NA  
##  9 Materials                        NA  
## 10 Real Estate                      32.5
## 11 Utilities                        NA

Dealing with missing values

In the last exercise, many sectors had an NA value for the harmonic mean. It would be useful for your function to be able to remove missing values before calculating.

Rather than writing your own code for this, you can outsource this functionality to mean().

The dplyr package is loaded.

EXERCISE:

Modify the signature and body of calc_harmonic_mean() so it has an na.rm argument, defaulting to false, that gets passed to mean().

# Add an na.rm arg with a default, and pass it to mean()
calc_harmonic_mean <- function(x, na.rm = FALSE) {
  x %>%
    get_reciprocal() %>%
    mean(na.rm = na.rm) %>%
    get_reciprocal()
}
std_and_poor500 %>% 
  # Group by sector
  group_by(sector) %>% 
  # Summarize, calculating harmonic mean of P/E ratio
  summarize(hmean_pe_ratio = calc_harmonic_mean(pe_ratio, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 11 x 2
##    sector                 hmean_pe_ratio
##    <chr>                           <dbl>
##  1 Communication Services           17.5
##  2 Consumer Discretionary           15.2
##  3 Consumer Staples                 19.8
##  4 Energy                           13.7
##  5 Financials                       12.9
##  6 Health Care                      26.6
##  7 Industrials                      18.2
##  8 Information Technology           21.6
##  9 Materials                        16.3
## 10 Real Estate                      32.5
## 11 Utilities                        23.9

Passing arguments with …

Rather than explicitly giving calc_harmonic_mean() and na.rm argument, you can use … to simply “pass other arguments” to mean().

The dplyr package is loaded.

EXERCISE.

Replace the na.rm argument with … in the signature and body of calc_harmonic_mean().

# Swap na.rm arg for ... in signature and body
calc_harmonic_mean <- function(x, ...) {
  x %>%
    get_reciprocal() %>%
    mean(...) %>%
    get_reciprocal()
}

Using std_and_poor500, group by sector, and summarize to calculate the harmonic mean of the price/earning ratios in the pe_ratio column, removing missing values.

std_and_poor500 %>% 
  # Group by sector
  group_by(sector) %>% 
  # Summarize, calculating harmonic mean of P/E ratio
  summarize(hmean_pe_ratio = calc_harmonic_mean(pe_ratio, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 11 x 2
##    sector                 hmean_pe_ratio
##    <chr>                           <dbl>
##  1 Communication Services           17.5
##  2 Consumer Discretionary           15.2
##  3 Consumer Staples                 19.8
##  4 Energy                           13.7
##  5 Financials                       12.9
##  6 Health Care                      26.6
##  7 Industrials                      18.2
##  8 Information Technology           21.6
##  9 Materials                        16.3
## 10 Real Estate                      32.5
## 11 Utilities                        23.9

Delightful use of dots! Did you notice that this code was the same as in the previous exercise? Using … doesn’t change how people use your function; it just means the function is more flexible. Whether flexible means better (or not) is up to you to decide.

Checking arguments

Throwing errors with bad arguments

If a user provides a bad input to a function, the best course of action is to throw an error letting them know. The two rules are

Throw the error message as soon as you realize there is a problem (typically at the start of the function). Make the error message easily understandable. You can use the assert_*() functions from assertive to check inputs and throw errors when they fail.

EXERCISE:

Add a line to the body of calc_harmonic_mean() to assert that x is numeric.

calc_harmonic_mean <- function(x, na.rm = FALSE) {
  # Assert that x is numeric
   assert_is_numeric(x)
  x %>%
    get_reciprocal() %>%
    mean(na.rm = na.rm) %>%
    get_reciprocal()
}

Look at what happens when you pass a character argument to calc_harmonic_mean().

# See what happens when you pass it strings
#calc_harmonic_mean(std_and_poor500$sector)

Amazing assertions! Providing human-readable error messages is important when your users are humans!

Custom error logic

Sometimes the assert_() functions in assertive don’t give the most informative error message. For example, the assertions that check if a number is in a numeric range will tell the user that a value is out of range, but the won’t say why that’s a problem. In that case, you can use the is_() functions in conjunction with messages, warnings, or errors to define custom feedback.

The harmonic mean only makes sense when x has all positive values. (Try calculating the harmonic mean of one and minus one to see why.) Make sure your users know this!

EXERCISE:

If any values of x are non-positive (ignoring NAs) then throw an error.

calc_harmonic_mean <- function(x, na.rm = FALSE) {
  assert_is_numeric(x)
  # Check if any values of x are non-positive
  if(any(is_non_positive(x), na.rm = TRUE)) {
    # Throw an error
    stop("x contains non-positive values, so the harmonic mean makes no sense.")
  }
  x %>%
    get_reciprocal() %>%
    mean(na.rm = na.rm) %>%
    get_reciprocal()
}

Look at what happens when you pass a character argument to calc_harmonic_mean().

# See what happens when you pass it negative numbers
#calc_harmonic_mean(std_and_poor500$pe_ratio - 20)

Cool custom logic! Explaining what went wrong is helpful to users. Explaining why it is wrong is even better!

Fixing function arguments

The harmonic mean function is almost complete. However, you still need to provide some checks on the na.rm argument. This time, rather than throwing errors when the input is in an incorrect form, you are going to try to fix it.

na.rm should be a logical vector with one element (that is, TRUE, or FALSE).

The assertive package is loaded for you.

EXERCISE:

Update calc_harmonic_mean() to fix the na.rm argument. Use use_first() to select the first element, and coerce_to() to change it to logical.

# Update the function definition to fix the na.rm argument
calc_harmonic_mean <- function(x, na.rm = FALSE) {
  assert_is_numeric(x)
  if(any(is_non_positive(x), na.rm = TRUE)) {
    stop("x contains non-positive values, so the harmonic mean makes no sense.")
  }
  # Use the first value of na.rm, and coerce to logical
  na.rm <- coerce_to(use_first(na.rm), target_class = "logical")
  x %>%
    get_reciprocal() %>%
    mean(na.rm = na.rm) %>%
    get_reciprocal()
}
# See what happens when you pass it malformed na.rm
#calc_harmonic_mean(std_and_poor500$pe_ratio, na.rm = 1:5)

Considerate argument correction! For small problems with inputs, it can be better to fix things rather than throwing an error.

Return values and scope

Returning values from functions

Returning early

Sometimes, you don’t need to run through the whole body of a function to get the answer. In that case you can return early from that function using return().

To check if x is divisible by n, you can use is_divisible_by(x, n) from assertive.

Alternatively, use the modulo operator, %%. x %% n gives the remainder when dividing x by n, so x %% n == 0 determines whether x is divisible by n. Try 1:10 %% 3 == 0 in the console.

To solve this exercise, you need to know that a leap year is every 400th year (like the year 2000) or every 4th year that isn’t a century (like 1904 but not 1900 or 1905).

assertive is loaded.

EXERCISE:

Complete the definition of is_leap_year(), checking for the cases of year being divisible by 400, then 100, then 4, returning early from the function in each case.

is_leap_year <- function(year) {
  # If year is div. by 400 return TRUE
  if(year %% 400 == 0) {
    return(TRUE)
  }
  # If year is div. by 100 return FALSE
  if(year %% 100 == 0) {
    return(FALSE)
  }  
  # If year is div. by 4 return TRUE
  if (year %% 4 == 0){
    TRUE
  }
  
  
  # Otherwise return FALSE
  else {
  FALSE
  }
}

Returning invisibly

When the main purpose of a function is to generate output, like drawing a plot or printing something in the console, you may not want a return value to be printed as well. In that case, the value should be invisibly returned.

The base R plot function returns NULL, since its main purpose is to draw a plot. This isn’t helpful if you want to use it in piped code: instead it should invisibly return the plot data to be piped on to the next step.

Recall that plot() has a formula interface: instead of giving it vectors for x and y, you can specify a formula describing which columns of a data frame go on the x and y axes, and a data argument for the data frame. Note that just like lm(), the arguments are the wrong way round because the detail argument, formula, comes before the data argument.

plot(y ~ x, data = data)

EXERCISE:

cars <- read.csv("./Data/cars.csv", sep  = ";", col.names = c( "speed", "dist"))

Use the cars dataset and the formula interface to plot(), draw a scatter plot of dist versus speed.

# Using cars, draw a scatter plot of dist vs. speed
plot(dist ~ speed, data = cars)

# Oh no! The plot object is NULL
#plt_dist_vs_speed

Give pipeable_plot() data and formula arguments (in that order). Make it draw the plot, then invisibly return data.

# Define a pipeable plot fn with data and formula args
pipeable_plot <- function(data, formula) {
   # Call plot() with the formula interface
  plot(formula, data)
  # Invisibly return the input dataset
  invisible(data)
}
# Draw the scatter plot of dist vs. speed again
plt_dist_vs_speed <- cars %>% 
  pipeable_plot(dist ~ speed)

# Now the plot object has a value
plt_dist_vs_speed
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

Returning multiple values from functions

Returning many things

Functions can only return one value. If you want to return multiple things, then you can store them all in a list.

If users want to have the list items as separate variables, they can assign each list element to its own variable using zeallot’s multi-assignment operator, %<-%.

glance(), tidy(), and augment() each take the model object as their only argument.

The Poisson regression model of Snake River visits is available as model. broom and zeallot are loaded.

EXERCISES:

Examine the structure of model.

str(model)
## List of 31
##  $ coefficients     : Named num [1:7] 4.0864 0.374 -0.0199 -0.5807 -0.5782 ...
##   ..- attr(*, "names")= chr [1:7] "(Intercept)" "genderfemale" "income($25k,$55k]" "income($55k,$95k]" ...
##  $ residuals        : Named num [1:346] -0.535 -0.768 -0.944 -0.662 -0.767 ...
##   ..- attr(*, "names")= chr [1:346] "25" "26" "27" "29" ...
##  $ fitted.values    : Named num [1:346] 4.3 4.3 17.83 2.96 4.29 ...
##   ..- attr(*, "names")= chr [1:346] "25" "26" "27" "29" ...
##  $ effects          : Named num [1:346] -360 -29.2 20.3 -10 23.4 ...
##   ..- attr(*, "names")= chr [1:346] "(Intercept)" "genderfemale" "income($25k,$55k]" "income($55k,$95k]" ...
##  $ R                : num [1:7, 1:7] -97.4 0 0 0 0 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:7] "(Intercept)" "genderfemale" "income($25k,$55k]" "income($55k,$95k]" ...
##   .. ..$ : chr [1:7] "(Intercept)" "genderfemale" "income($25k,$55k]" "income($55k,$95k]" ...
##  $ rank             : int 7
##  $ qr               :List of 5
##   ..$ qr   : num [1:346, 1:7] -97.3861 0.0213 0.0434 0.0177 0.0213 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:346] "25" "26" "27" "29" ...
##   .. .. ..$ : chr [1:7] "(Intercept)" "genderfemale" "income($25k,$55k]" "income($55k,$95k]" ...
##   ..$ rank : int 7
##   ..$ qraux: num [1:7] 1.02 1.02 1.04 1.01 1 ...
##   ..$ pivot: int [1:7] 1 2 3 4 5 6 7
##   ..$ tol  : num 1e-11
##   ..- attr(*, "class")= chr "qr"
##  $ family           :List of 12
##   ..$ family    : chr "poisson"
##   ..$ link      : chr "log"
##   ..$ linkfun   :function (mu)  
##   ..$ linkinv   :function (eta)  
##   ..$ variance  :function (mu)  
##   ..$ dev.resids:function (y, mu, wt)  
##   ..$ aic       :function (y, n, mu, wt, dev)  
##   ..$ mu.eta    :function (eta)  
##   ..$ initialize:  expression({  if (any(y < 0))  stop("negative values not allowed for the 'Poisson' family")  n <- rep.int(1, nobs| __truncated__
##   ..$ validmu   :function (mu)  
##   ..$ valideta  :function (eta)  
##   ..$ simulate  :function (object, nsim)  
##   ..- attr(*, "class")= chr "family"
##  $ linear.predictors: Named num [1:346] 1.46 1.46 2.88 1.09 1.46 ...
##   ..- attr(*, "names")= chr [1:346] "25" "26" "27" "29" ...
##  $ deviance         : num 11529
##  $ aic              : num 12864
##  $ null.deviance    : num 18850
##  $ iter             : int 6
##  $ weights          : Named num [1:346] 4.3 4.3 17.83 2.96 4.29 ...
##   ..- attr(*, "names")= chr [1:346] "25" "26" "27" "29" ...
##  $ prior.weights    : Named num [1:346] 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "names")= chr [1:346] "25" "26" "27" "29" ...
##  $ df.residual      : int 339
##  $ df.null          : int 345
##  $ y                : Named num [1:346] 2 1 1 1 1 1 80 104 55 350 ...
##   ..- attr(*, "names")= chr [1:346] "25" "26" "27" "29" ...
##  $ converged        : logi TRUE
##  $ boundary         : logi FALSE
##  $ model            :'data.frame':   346 obs. of  4 variables:
##   ..$ n_visits: num [1:346] 2 1 1 1 1 1 80 104 55 350 ...
##   ..$ gender  : Factor w/ 2 levels "male","female": 2 2 1 1 2 1 2 2 1 2 ...
##   ..$ income  : Factor w/ 4 levels "[$0,$25k]","($25k,$55k]",..: 4 4 4 4 3 1 1 4 2 2 ...
##   ..$ travel  : Factor w/ 3 levels "[0h,0.25h]","(0.25h,4h]",..: 3 3 2 3 3 1 1 1 2 1 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language n_visits ~ gender + income + travel
##   .. .. ..- attr(*, "variables")= language list(n_visits, gender, income, travel)
##   .. .. ..- attr(*, "factors")= int [1:4, 1:3] 0 1 0 0 0 0 1 0 0 0 ...
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:4] "n_visits" "gender" "income" "travel"
##   .. .. .. .. ..$ : chr [1:3] "gender" "income" "travel"
##   .. .. ..- attr(*, "term.labels")= chr [1:3] "gender" "income" "travel"
##   .. .. ..- attr(*, "order")= int [1:3] 1 1 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: 0x000000002a18d358> 
##   .. .. ..- attr(*, "predvars")= language list(n_visits, gender, income, travel)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:4] "numeric" "factor" "factor" "factor"
##   .. .. .. ..- attr(*, "names")= chr [1:4] "n_visits" "gender" "income" "travel"
##   ..- attr(*, "na.action")= 'omit' Named int [1:64] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..- attr(*, "names")= chr [1:64] "1" "2" "3" "4" ...
##  $ na.action        : 'omit' Named int [1:64] 1 2 3 4 5 6 7 8 9 10 ...
##   ..- attr(*, "names")= chr [1:64] "1" "2" "3" "4" ...
##  $ call             : language glm(formula = formula, family = poisson, data = data)
##  $ formula          :Class 'formula'  language n_visits ~ gender + income + travel
##   .. ..- attr(*, ".Environment")=<environment: 0x000000002a18d358> 
##  $ terms            :Classes 'terms', 'formula'  language n_visits ~ gender + income + travel
##   .. ..- attr(*, "variables")= language list(n_visits, gender, income, travel)
##   .. ..- attr(*, "factors")= int [1:4, 1:3] 0 1 0 0 0 0 1 0 0 0 ...
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:4] "n_visits" "gender" "income" "travel"
##   .. .. .. ..$ : chr [1:3] "gender" "income" "travel"
##   .. ..- attr(*, "term.labels")= chr [1:3] "gender" "income" "travel"
##   .. ..- attr(*, "order")= int [1:3] 1 1 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: 0x000000002a18d358> 
##   .. ..- attr(*, "predvars")= language list(n_visits, gender, income, travel)
##   .. ..- attr(*, "dataClasses")= Named chr [1:4] "numeric" "factor" "factor" "factor"
##   .. .. ..- attr(*, "names")= chr [1:4] "n_visits" "gender" "income" "travel"
##  $ data             :'data.frame':   410 obs. of  4 variables:
##   ..$ n_visits: num [1:410] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ gender  : Factor w/ 2 levels "male","female": 1 1 1 2 1 2 2 2 1 1 ...
##   ..$ income  : Factor w/ 4 levels "[$0,$25k]","($25k,$55k]",..: 4 2 4 2 4 2 4 4 4 4 ...
##   ..$ travel  : Factor w/ 3 levels "[0h,0.25h]","(0.25h,4h]",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ offset           : NULL
##  $ control          :List of 3
##   ..$ epsilon: num 1e-08
##   ..$ maxit  : num 25
##   ..$ trace  : logi FALSE
##  $ method           : chr "glm.fit"
##  $ contrasts        :List of 3
##   ..$ gender: chr "contr.treatment"
##   ..$ income: chr "contr.treatment"
##   ..$ travel: chr "contr.treatment"
##  $ xlevels          :List of 3
##   ..$ gender: chr [1:2] "male" "female"
##   ..$ income: chr [1:4] "[$0,$25k]" "($25k,$55k]" "($55k,$95k]" "($95k,$Inf)"
##   ..$ travel: chr [1:3] "[0h,0.25h]" "(0.25h,4h]" "(4h,Infh)"
##  - attr(*, "class")= chr [1:2] "glm" "lm"

Use broom functions to create a list containing the model-, coefficient-, and observation-level parts of model.

# Use broom tools to get a list of 3 data frames
list(
  # Get model-level values
  model = glance(model),
  # Get coefficient-level values
  coefficients = tidy(model),
  # Get observation-level values
  observations = augment(model)
)
## $model
## # A tibble: 1 x 8
##   null.deviance df.null logLik    AIC    BIC deviance df.residual  nobs
##           <dbl>   <int>  <dbl>  <dbl>  <dbl>    <dbl>       <int> <int>
## 1        18850.     345 -6425. 12864. 12891.   11529.         339   346
## 
## $coefficients
## # A tibble: 7 x 5
##   term              estimate std.error statistic   p.value
##   <chr>                <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)         4.09      0.0279   146.    0.       
## 2 genderfemale        0.374     0.0212    17.6   2.18e- 69
## 3 income($25k,$55k]  -0.0199    0.0267    -0.746 4.56e-  1
## 4 income($55k,$95k]  -0.581     0.0343   -16.9   3.28e- 64
## 5 income($95k,$Inf)  -0.578     0.0310   -18.7   6.88e- 78
## 6 travel(0.25h,4h]   -0.627     0.0217   -28.8   5.40e-183
## 7 travel(4h,Infh)    -2.42      0.0492   -49.3   0.       
## 
## $observations
## # A tibble: 346 x 11
##    .rownames n_visits gender income travel .fitted  .resid .std.resid    .hat
##    <chr>        <dbl> <fct>  <fct>  <fct>    <dbl>   <dbl>      <dbl>   <dbl>
##  1 25               2 female ($95k~ (4h,I~    1.46  -1.24      -1.25  0.0109 
##  2 26               1 female ($95k~ (4h,I~    1.46  -1.92      -1.93  0.0109 
##  3 27               1 male   ($95k~ (0.25~    2.88  -5.28      -5.32  0.0129 
##  4 29               1 male   ($95k~ (4h,I~    1.09  -1.32      -1.33  0.00711
##  5 30               1 female ($55k~ (4h,I~    1.46  -1.92      -1.93  0.0121 
##  6 31               1 male   [$0,$~ [0h,0~    4.09 -10.4      -10.7   0.0465 
##  7 33              80 female [$0,$~ [0h,0~    4.46  -0.710     -0.728 0.0479 
##  8 34             104 female ($95k~ [0h,0~    3.88   6.90       7.02  0.0332 
##  9 35              55 male   ($25k~ (0.25~    3.44   3.85       3.88  0.0153 
## 10 36             350 female ($25k~ [0h,0~    4.44  21.5       21.9   0.0360 
## # ... with 336 more rows, and 2 more variables: .sigma <dbl>, .cooksd <dbl>

Wrap the code into a function, groom_model(), that accepts model as its only argument.

# From previous step
groom_model <- function(model) {
  list(
    model = glance(model),
    coefficients = tidy(model),
    observations = augment(model)
  )
}

Call groom_model() on model, multi-assigning the result to three variables at once: mdl, cff, and obs.

# Call groom_model on model, assigning to 3 variables
#c(mdl, cff, obs) %<-% groom_model(model)
# See these individual variables
#mdl; cff; obs

Magnificent multi-assignment! Returning many values is as easy as collecting them into a list. The groomed model has data frames that are easy to program against.

Returning metadata

Sometimes you want the return multiple things from a function, but you want the result to have a particular class (for example, a data frame or a numeric vector), so returning a list isn’t appropriate. This is common when you have a result plus metadata about the result. (Metadata is “data about the data”. For example, it could be the file a dataset was loaded from, or the username of the person who created the variable, or the number of iterations for an algorithm to converge.)

In that case, you can store the metadata in attributes. Recall the syntax for assigning attributes is as follows.

attr(object, “attribute_name”) <- attribute_value

EXERCISES:

Update pipeable_plot() so the result has an attribute named “formula” with the value of formula.

pipeable_plot <- function(data, formula) {
  plot(formula, data)
  # Add a "formula" attribute to data
  attr(data, "formula") <- formula
  invisible(data)
}

# From previous exercise
plt_dist_vs_speed <- cars %>% 
  pipeable_plot(dist ~ speed)

plt_dist_vs_speed, that you previously created, is shown. Examine its updated structure.

# Examine the structure of the result
str(plt_dist_vs_speed)
## 'data.frame':    6 obs. of  2 variables:
##  $ speed: int  4 4 7 7 8 9
##  $ dist : int  2 10 4 22 16 10
##  - attr(*, "formula")=Class 'formula'  language dist ~ speed
##   .. ..- attr(*, ".Environment")=<environment: 0x000000006525c5a8>

Case study on grain yields

Converting areas to metric 1

In this chapter, you’ll be working with grain yield data from the United States Department of Agriculture, National Agricultural Statistics Service. Unfortunately, they report all areas in acres. So, the first thing you need to do is write some utility functions to convert areas in acres to areas in hectares.

To solve this exercise, you need to know the following:

There are 4840 square yards in an acre. There are 36 inches in a yard and one inch is 0.0254 meters. There are 10000 square meters in a hectare.

EXERCISES:

Write a function, acres_to_sq_yards(), to convert areas in acres to areas in square yards. This should take a single argument, acres.

# Write a function to convert acres to sq. yards
acres_to_sq_yards <- function(acres) {
  acres * 4840
}

Write a function, yards_to_meters(), to convert distances in yards to distances in meters. This should take a single argument, yards.

# Write a function to convert yards to meters
# Write a function to convert yards to meters
yards_to_meters <- function(yards) {
  (yards * 36) * 0.0254
}

Write a function, sq_meters_to_hectares(), to convert areas in square meters to areas in hectares. This should take a single argument, sq_meters.

# Write a function to convert sq. meters to hectares
sq_meters_to_hectares <- function(sq_meters) {
  sq_meters / 10000
}

Converting areas to metric 2

You’re almost there with creating a function to convert acres to hectares. You need another utility function to deal with getting from square yards to square meters. Then, you can bring everything together to write the overall acres-to-hectares conversion function. Finally, in the next exercise you’ll be calculating area conversions in the denominator of a ratio, so you’ll need a harmonic acre-to-hectare conversion function.

Free hints: magrittr’s raise_to_power() will be useful here. The last step is similar to Chapter 2’s Harmonic Mean.

The three utility functions from the last exercise (acres_to_sq_yards(), yards_to_meters(), and sq_meters_to_hectares()) are available, as is your get_reciprocal() from Chapter 2. magrittr is loaded.

EXERCISES:

Write a function to convert distance in square yards to square meters. It should take the square root of the input, then convert yards to meters, then square the result.

# Write a function to convert sq. yards to sq. meters
sq_yards_to_sq_meters <- function(sq_yards) {
  sq_yards %>%
    # Take the square root
    sqrt() %>%
    # Convert yards to meters
    yards_to_meters() %>%
    # Square it
    raise_to_power(2)
}

Write a function to convert areas in acres to hectares. The function should convert the input from acres to square yards, then to square meters, then to hectares.

# Write a function to convert acres to hectares
acres_to_hectares <- function(acres) {
  acres %>%
    # Convert acres to sq yards
    acres_to_sq_yards() %>%
    # Convert sq yards to sq meters
    sq_yards_to_sq_meters() %>%
    # Convert sq meters to hectares
    sq_meters_to_hectares()
}

Write a function to harmonically convert areas in acres to hectares. The function should get the reciprocal of the input, then convert from acres to hectares, then get the reciprocal again.

# reciprocal function
get_reciprocal <- function(x) {
 1/x
}

# Define a harmonic acres to hectares function
harmonic_acres_to_hectares <- function(acres) {
  acres %>% 
    # Get the reciprocal
    get_reciprocal() %>%
    # Convert acres to hectares
    acres_to_hectares() %>% 
    # Get the reciprocal again
    get_reciprocal()
}

Amazing area conversion! By breaking down this conversion into lots of simple functions, you have easy to read code, which helps guard against bugs.

Converting yields to metric

The yields in the NASS corn data are also given in US units, namely bushels per acre. You’ll need to write some more utility functions to convert this unit to the metric unit of kg per hectare.

Bushels historically meant a volume of 8 gallons, but in the context of grain, they are now defined as masses. This mass differs for each grain! To solve this exercise, you need to know these facts.

One pound (lb) is 0.45359237 kilograms (kg). One bushel is 48 lbs of barley, 56 lbs of corn, or 60 lbs of wheat. magrittr is loaded.

EXERCISES.

Write a function to convert masses in lb to kg. This should take a single argument, lbs.

# Write a function to convert lb to kg
lbs_to_kgs <- function(lbs){
  lbs * 0.45359237
}

Write a function to convert masses in bushels to lbs. This should take two arguments, bushels and crop. It should define a lookup vector of scale factors for each crop (barley, corn, wheat), extract the scale factor for the crop, then multiply this by the number of bushels.

# Write a function to convert bushels to lbs
bushels_to_lbs <- function(bushels, crop) {
  # Define a lookup table of scale factors
  c(barley = 48, corn = 56, wheat = 60) %>%
    # Extract the value for the crop
    extract(crop) %>%
    # Multiply by the no. of bushels
    multiply_by(bushels)
}

Write a function to convert masses in bushels to kgs. This should take two arguments, bushels and crop. It should convert the mass in bushels to lbs then to kgs.

# Write a function to convert bushels to kg
bushels_to_kgs <- function(bushels, crop) {
  bushels %>%
    # Convert bushels to lbs for this crop
    bushels_to_lbs(crop) %>%
    # Convert lbs to kgs
    lbs_to_kgs()
}

Write a function to convert yields in bushels/acre to kg/ha. The arguments should be bushels_per_acre and crop. Three choices of crop should be allowed: “barley”, “corn”, and “wheat”. It should match the crop argument, then convert bushels to kgs, then convert harmonic acres to hectares.

bushels_per_acre_to_kgs_per_hectare <- function(bushels_per_acre, crop = c("barley", "corn", "wheat")) {
  # Match the crop argument
  crop <- match.arg(crop)
  bushels_per_acre %>%
    # Convert bushels to kgs for this crop
    bushels_to_kgs(crop) %>%
    # Convert harmonic acres to ha
    harmonic_acres_to_hectares()
}

Applying the unit conversion

Now that you’ve written some functions, it’s time to apply them! The NASS corn dataset is available, and you can fortify it (jargon for “adding new columns”) with metrics areas and yields.

This fortification process can also be turned in to a function, so you’ll define a function for this, and test it on the NASS wheat dataset.

exercises:

# View the corn dataset
corn <- readRDS("./Data/nass.corn.rds")
wheat <- readRDS("./Data/nass.wheat.rds")
barney <- readRDS("./Data/nass.barley.rds")
# View the corn dataset
glimpse(corn)
## Rows: 6,381
## Columns: 4
## $ year                   <int> 1866, 1866, 1866, 1866, 1866, 1866, 1866, 18...
## $ state                  <chr> "Alabama", "Arkansas", "California", "Connec...
## $ farmed_area_acres      <dbl> 1050000, 280000, 42000, 57000, 200000, 12500...
## $ yield_bushels_per_acre <dbl> 9.0, 18.0, 28.0, 34.0, 23.0, 9.0, 6.0, 29.0,...

Look at the corn dataset. Add two columns: farmed_area_ha should be farmed_area_acres converted to hectares; yield_kg_per_ha should be yield_bushels_per_acre converted to kilograms per hectare.

#corn %>%
  # Add some columns
  #mutate(
    # Convert farmed area from acres to ha
  #  farmed_area_ha = acres_to_hectares(farmed_area_acres),
    # Convert yield from bushels/acre to kg/ha
   # yield_kg_per_ha = bushels_per_acre_to_kgs_per_hectare(
    #  yield_bushels_per_acre, 
    #  crop = "corn"
   # )
  #) %>% top_n(n = 10)

Wrap the mutation code into a function, fortify_with_metric_units. This should take two arguments, data and crop with no defaults. (In the function body, remember to swap corn for the data argument.)

# Wrap this code into a function
fortify_with_metric_units <- function(data, crop) {
  data %>%
    mutate(
      farmed_area_ha = acres_to_hectares(farmed_area_acres),
      yield_kg_per_ha = bushels_per_acre_to_kgs_per_hectare(
        yield_bushels_per_acre, 
        crop = crop
      )
    )
}

Use fortify_with_metric_units() on the wheat dataset.

# Try it on the wheat dataset
#fortify_with_metric_units(wheat, crop = "wheat") %>% top_n(n = 10)

Visualizing grain yields

Plotting yields over time

Now that the units have been dealt with, it’s time to explore the datasets. An obvious question to ask about each crop is, “how do the yields change over time in each US state?” Let’s draw a line plot to find out!

ggplot2 is loaded, and corn and wheat datasets are available with metric units.

EXERCISE:

Using the corn dataset, plot yield_kg_per_ha versus year. Add a line layer grouped by state and a smooth trend layer.

# Using corn, plot yield (kg/ha) vs. year
ggplot(corn, aes(x = year, y = yield_bushels_per_acre)) +
  # Add a line layer, grouped by state
  geom_line(aes(group = state)) +
  # Add a smooth trend layer
  geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Turn the plotting code into a function, plot_yield_vs_year(). This should accept a single argument, data.

# Wrap this plotting code into a function
plot_yield_vs_year <- function(data){
  ggplot(data, aes(year, yield_bushels_per_acre)) +
    geom_line(aes(group = state)) +
    geom_smooth()
}
# Test it on the wheat dataset
plot_yield_vs_year(wheat)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Running a model

The smooth trend line you saw in the plots of yield over time use a generalized additive model (GAM) to determine where the line should lie. This sort of model is ideal for fitting nonlinear curves. So we can make predictions about future yields, let’s explicitly run the model. The syntax for running this GAM takes the following form.

gam(response ~ s(explanatory_var1) + explanatory_var2, data = dataset) Here, s() means “make the variable smooth”, where smooth very roughly means nonlinear.

mgcv and dplyr are loaded; the corn and wheat datasets are available.

EXERCISES.

Run a GAM of yield_kg_per_ha versus smoothed year and census region, using the corn dataset.

# yield vs. smoothed year and census region
#gam(yield_kg_per_ha ~ s(year) + census_region, data = corn)

16 .- Correlation and Regression

Visualizing two variables

Scatterplots

Scatterplots are the most common and effective tools for visualizing the relationship between two numeric variables.

The ncbirths dataset is a random sample of 1,000 cases taken from a larger dataset collected in 2004. Each case describes the birth of a single child born in North Carolina, along with various characteristics of the child (e.g. birth weight, length of gestation, etc.), the child’s mother (e.g. age, weight gained during pregnancy, smoking habits, etc.) and the child’s father (e.g. age). You can view the help file for these data by running ?ncbirths in the console.

ncbirths <- read.csv("./Data/ncbirths.csv", sep = ",")

Exercises

Using the ncbirths dataset, make a scatterplot using ggplot() to illustrate how the birth weight of these babies varies according to the number of weeks of gestation.

glimpse(ncbirths)
## Rows: 1,000
## Columns: 13
## $ fage           <int> NA, NA, 19, 21, NA, NA, 18, 17, NA, 20, 30, NA, NA, ...
## $ mage           <int> 13, 14, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, ...
## $ mature         <chr> "younger mom", "younger mom", "younger mom", "younge...
## $ weeks          <int> 39, 42, 37, 41, 39, 38, 37, 35, 38, 37, 45, 42, 40, ...
## $ premie         <chr> "full term", "full term", "full term", "full term", ...
## $ visits         <int> 10, 15, 11, 6, 9, 19, 12, 5, 9, 13, 9, 8, 4, 12, 15,...
## $ marital        <chr> "not married", "not married", "not married", "not ma...
## $ gained         <int> 38, 20, 38, 34, 27, 22, 76, 15, NA, 52, 28, 34, 12, ...
## $ weight         <dbl> 7.63, 7.88, 6.63, 8.00, 6.38, 5.38, 8.44, 4.69, 8.81...
## $ lowbirthweight <chr> "not low", "not low", "not low", "not low", "not low...
## $ gender         <chr> "male", "male", "female", "male", "female", "male", ...
## $ habit          <chr> "nonsmoker", "nonsmoker", "nonsmoker", "nonsmoker", ...
## $ whitemom       <chr> "not white", "not white", "white", "white", "not whi...
# Scatterplot of weight vs. weeks
ggplot(ncbirths, aes(weeks,weight)) + 
    geom_point()
## Warning: Removed 2 rows containing missing values (geom_point).

### Boxplots as discretized/conditioned scatterplots

If it is helpful, you can think of boxplots as scatterplots for which the variable on the x-axis has been discretized.

The cut() function takes two arguments: the continuous variable you want to discretize and the number of breaks that you want to make in that continuous variable in order to discretize it.

EXERCISES:

Using the ncbirths dataset again, make a boxplot illustrating how the birth weight of these babies varies according to the number of weeks of gestation. This time, use the cut() function to discretize the x-variable into six intervals (i.e. five breaks).

# Boxplot of weight vs. weeks
ggplot(data = ncbirths, 
       aes(x = cut(weeks, breaks = 5), y = weight)) + 
  geom_boxplot()

Characterizing bivariate relationships

Creating scatterplots

Creating scatterplots is simple and they are so useful that it is worthwhile to expose yourself to many examples. Over time, you will gain familiarity with the types of patterns that you see. You will begin to recognize how scatterplots can reveal the nature of the relationship between two variables.

In this exercise, and throughout this chapter, we will be using several datasets listed below. These data are available through the openintro package. Briefly:

The mammals dataset contains information about 39 different species of mammals, including their body weight, brain weight, gestation time, and a few other variables. The mlbBat10 dataset contains batting statistics for 1,199 Major League Baseball players during the 2010 season. The bdims dataset contains body girth and skeletal diameter measurements for 507 physically active individuals. The smoking dataset contains information on the smoking habits of 1,691 citizens of the United Kingdom. To see more thorough documentation, use the ? or help() functions.

EXERCISES.

Using the mammals dataset, create a scatterplot illustrating how the brain weight of a mammal varies as a function of its body weight.

# Mammals scatterplot
library(openintro)
## Warning: package 'openintro' was built under R version 4.0.3
## Loading required package: airports
## Warning: package 'airports' was built under R version 4.0.3
## Loading required package: cherryblossom
## Warning: package 'cherryblossom' was built under R version 4.0.3
## Loading required package: usdata
## Warning: package 'usdata' was built under R version 4.0.3
## 
## Attaching package: 'openintro'
## The following object is masked _by_ '.GlobalEnv':
## 
##     ncbirths
## The following objects are masked from 'package:lattice':
## 
##     ethanol, lsegments
## The following objects are masked from 'package:MASS':
## 
##     housing, mammals
head(mammals)
## # A tibble: 6 x 11
##   species body_wt brain_wt non_dreaming dreaming total_sleep life_span gestation
##   <fct>     <dbl>    <dbl>        <dbl>    <dbl>       <dbl>     <dbl>     <dbl>
## 1 Africa~ 6654      5712           NA       NA           3.3      38.6       645
## 2 Africa~    1         6.6          6.3      2           8.3       4.5        42
## 3 Arctic~    3.38     44.5         NA       NA          12.5      14          60
## 4 Arctic~    0.92      5.7         NA       NA          16.5      NA          25
## 5 Asiane~ 2547      4603            2.1      1.8         3.9      69         624
## 6 Baboon    10.6     180.           9.1      0.7         9.8      27         180
## # ... with 3 more variables: predation <int>, exposure <int>, danger <int>
names(mammals)
##  [1] "species"      "body_wt"      "brain_wt"     "non_dreaming" "dreaming"    
##  [6] "total_sleep"  "life_span"    "gestation"    "predation"    "exposure"    
## [11] "danger"
#ggplot(mammals, aes(BodyWt, brain_wt)) + 
    #geom_point()

Using the mlbBat10 dataset, create a scatterplot illustrating how the slugging percentage (SLG) of a player varies as a function of his on-base percentage (OBP).

mlbbat10 <- read.csv("./Data/mlbbat10.csv", sep = ",")
# Baseball player scatterplot
head(mlbbat10)
##        name team position game at_bat run hit double triple home_run rbi
## 1  I Suzuki  SEA       OF  162    680  74 214     30      3        6  43
## 2   D Jeter  NYY       SS  157    663 111 179     30      3       10  67
## 3   M Young  TEX       3B  157    656  99 186     36      3       21  91
## 4  J Pierre  CWS       OF  160    651  96 179     18      3        1  47
## 5   R Weeks  MIL       2B  160    651 112 175     32      4       29  83
## 6 M Scutaro  BOS       SS  150    632  92 174     38      0       11  56
##   total_base walk strike_out stolen_base caught_stealing   obp   slg bat_avg
## 1        268   45         86          42               9 0.359 0.394   0.315
## 2        245   63        106          18               5 0.340 0.370   0.270
## 3        291   50        115           4               2 0.330 0.444   0.284
## 4        206   45         47          68              18 0.341 0.316   0.275
## 5        302   76        184          11               4 0.366 0.464   0.269
## 6        245   53         71           5               4 0.333 0.388   0.275
colnames(mlbbat10)[5] <- "AB"
colnames(mlbbat10)[17] <- "OBP"
colnames(mlbbat10)[18] <- "SLG"
names(mlbbat10)
##  [1] "name"            "team"            "position"        "game"           
##  [5] "AB"              "run"             "hit"             "double"         
##  [9] "triple"          "home_run"        "rbi"             "total_base"     
## [13] "walk"            "strike_out"      "stolen_base"     "caught_stealing"
## [17] "OBP"             "SLG"             "bat_avg"
mlbbat10 %>%
ggplot(aes(OBP
, SLG)) +
geom_point()

Using the bdims dataset, create a scatterplot illustrating how a person’s weight varies as a function of their height. Use color to separate by sex, which you’ll need to coerce to a factor with factor().

bdims <- read.csv("./Data/bdims.csv", sep = ",")
# Body dimensions scatterplot
ggplot(data = bdims, aes(x = hgt, y = wgt, color = factor(sex))) +
  geom_point()

Using the smoking dataset, create a scatterplot illustrating how the amount that a person smokes on weekdays varies as a function of their age.

smoking <- read.csv("./Data/smoking.csv", sep = ",")
names(smoking)
##  [1] "gender"                "age"                   "marital_status"       
##  [4] "highest_qualification" "nationality"           "ethnicity"            
##  [7] "gross_income"          "region"                "smoke"                
## [10] "amt_weekends"          "amt_weekdays"          "type"
# Smoking scatterplot
ggplot(data = smoking, aes(x = age, y = amt_weekdays)) +
  geom_point()
## Warning: Removed 1270 rows containing missing values (geom_point).

Transformations

The relationship between two variables may not be linear. In these cases we can sometimes see strange and even inscrutable patterns in a scatterplot of the data. Sometimes there really is no meaningful relationship between the two variables. Other times, a careful transformation of one or both of the variables can reveal a clear relationship.

Recall the bizarre pattern that you saw in the scatterplot between brain weight and body weight among mammals in a previous exercise. Can we use transformations to clarify this relationship?

ggplot2 provides several different mechanisms for viewing transformed relationships. The coord_trans() function transforms the coordinates of the plot. Alternatively, the scale_x_log10() and scale_y_log10() functions perform a base-10 log transformation of each axis. Note the differences in the appearance of the axes.

The mammals dataset is available in your workspace.

EXERCISES:

Use coord_trans() to create a scatterplot showing how a mammal’s brain weight varies as a function of its body weight, where both the x and y axes are on a “log10” scale.

colnames(mammals)[2] <- "BodyWt"
colnames(mammals)[3] <- "brain_wt"
names(mammals)
##  [1] "species"      "BodyWt"       "brain_wt"     "non_dreaming" "dreaming"    
##  [6] "total_sleep"  "life_span"    "gestation"    "predation"    "exposure"    
## [11] "danger"
# Scatterplot with coord_trans()
ggplot(data = mammals, aes(x = BodyWt, y = brain_wt)) +
  geom_point() + 
  coord_trans(x = "log10", y = "log10")

Use scale_x_log10() and scale_y_log10() to achieve the same effect but with different axis labels and grid lines.

names(mammals)
##  [1] "species"      "BodyWt"       "brain_wt"     "non_dreaming" "dreaming"    
##  [6] "total_sleep"  "life_span"    "gestation"    "predation"    "exposure"    
## [11] "danger"
colnames(mammals)[2] <- "BodyWt"
colnames(mammals)[3] <- "brain_wt"
# Scatterplot with scale_x_log10() and scale_y_log10()
ggplot(data = mammals, aes(x = BodyWt, y = brain_wt)) +
  geom_point() +
  scale_x_log10() + 
  scale_y_log10()

Outliers

Identifying outliers

In Chapter 5, we will discuss how outliers can affect the results of a linear regression model and how we can deal with them. For now, it is enough to simply identify them and note how the relationship between two variables may change as a result of removing outliers.

Recall that in the baseball example earlier in the chapter, most of the points were clustered in the lower left corner of the plot, making it difficult to see the general pattern of the majority of the data. This difficulty was caused by a few outlying players whose on-base percentages (OBPs) were exceptionally high. These values are present in our dataset only because these players had very few batting opportunities.

Both OBP and SLG are known as rate statistics, since they measure the frequency of certain events (as opposed to their count). In order to compare these rates sensibly, it makes sense to include only players with a reasonable number of opportunities, so that these observed rates have the chance to approach their long-run frequencies.

In Major League Baseball, batters qualify for the batting title only if they have 3.1 plate appearances per game. This translates into roughly 502 plate appearances in a 162-game season. The mlbBat10 dataset does not include plate appearances as a variable, but we can use at-bats (AB) – which constitute a subset of plate appearances – as a proxy.

EXERCISES:

# Scatterplot of SLG vs. OBP
str(mlbbat10)
## 'data.frame':    1199 obs. of  19 variables:
##  $ name           : chr  "I Suzuki" "D Jeter" "M Young" "J Pierre" ...
##  $ team           : chr  "SEA" "NYY" "TEX" "CWS" ...
##  $ position       : chr  "OF" "SS" "3B" "OF" ...
##  $ game           : int  162 157 157 160 160 150 160 153 160 155 ...
##  $ AB             : int  680 663 656 651 651 632 629 629 626 626 ...
##  $ run            : int  74 111 99 96 112 92 79 85 103 100 ...
##  $ hit            : int  214 179 186 179 175 174 187 166 200 172 ...
##  $ double         : int  30 30 36 18 32 38 45 24 41 33 ...
##  $ triple         : int  3 3 3 3 4 0 3 10 3 5 ...
##  $ home_run       : int  6 10 21 1 29 11 12 3 29 18 ...
##  $ rbi            : int  43 67 91 47 83 56 60 58 109 59 ...
##  $ total_base     : int  268 245 291 206 302 245 274 219 334 269 ...
##  $ walk           : int  45 63 50 45 76 53 73 60 57 46 ...
##  $ strike_out     : int  86 106 115 47 184 71 93 74 77 83 ...
##  $ stolen_base    : int  42 18 4 68 11 5 7 26 3 16 ...
##  $ caught_stealing: int  9 5 2 18 4 4 2 4 2 12 ...
##  $ OBP            : num  0.359 0.34 0.33 0.341 0.366 0.333 0.37 0.331 0.381 0.332 ...
##  $ SLG            : num  0.394 0.37 0.444 0.316 0.464 0.388 0.436 0.348 0.534 0.43 ...
##  $ bat_avg        : num  0.315 0.27 0.284 0.275 0.269 0.275 0.297 0.264 0.319 0.275 ...

Use filter() to keep only players who had at least 200 at-bats, assigning to ab_gt_200.

names(mlbbat10)
##  [1] "name"            "team"            "position"        "game"           
##  [5] "AB"              "run"             "hit"             "double"         
##  [9] "triple"          "home_run"        "rbi"             "total_base"     
## [13] "walk"            "strike_out"      "stolen_base"     "caught_stealing"
## [17] "OBP"             "SLG"             "bat_avg"
mlbbat10 %>%
  filter(AB >= 200) %>%
  ggplot(aes(x = OBP, y = SLG)) +
  geom_point()

Using ab_gt_200, create a scatterplot for SLG as a function of OBP.

mlbbat10 %>%
  filter(
    AB >= 200, 
    OBP < 0.2
    )
##     name team position game  AB run hit double triple home_run rbi total_base
## 1 B Wood  LAA       3B   81 226  20  33      2      0        4  14         47
##   walk strike_out stolen_base caught_stealing   OBP   SLG bat_avg
## 1    6         71           1               0 0.174 0.208   0.146

Find the row of ab_gt_200 corresponding to the one player (with at least 200 at-bats) whose OBP was below 0.200.

Correlation

Quantifying the strength of bivariate relationships

The cor(x, y) function will compute the Pearson product-moment correlation between variables, x and y. Since this quantity is symmetric with respect to x and y, it doesn’t matter in which order you put the variables.

At the same time, the cor() function is very conservative when it encounters missing data (e.g. NAs). The use argument allows you to override the default behavior of returning NA whenever any of the values encountered is NA. Setting the use argument to “pairwise.complete.obs” allows cor() to compute the correlation coefficient for those observations where the values of x and y are both not missing.

EXERCISES:

Use cor() to compute the correlation between the birthweight of babies in the ncbirths dataset and their mother’s age. There is no missing data in either variable.

# Compute correlation
ncbirths %>%
  summarize(N = n(), r = cor(mage, weight))
##      N          r
## 1 1000 0.05506589

Compute the correlation between the birthweight and the number of weeks of gestation for all non-missing pairs.

# Compute correlation for all non-missing pairs
ncbirths %>%
  summarize(N = n(), r = cor(weeks, weight, use = "pairwise.complete.obs"))
##      N         r
## 1 1000 0.6701013

The Anscombe dataset

4 data sets with similar statistics but very different patterns This is an example of why we need to be careful when just looking at summary statistics

Anscombe <- read.csv("./Data/anscombes.csv", sep = ",")
ggplot(Anscombe, aes(x,y)) + 
  geom_point() + 
  facet_wrap(~ dataset)

# Compute properties of Anscombe
head(Anscombe)
##   id dataset  x    y
## 1  0       I 10 8.04
## 2  1       I  8 6.95
## 3  2       I 13 7.58
## 4  3       I  9 8.81
## 5  4       I 11 8.33
## 6  5       I 14 9.96
Anscombe %>%
  group_by(dataset) %>%
  summarize(
    N = n(), 
    mean_of_x = mean(x), 
    std_dev_of_x = sd(x), 
    mean_of_y = mean(y), 
    std_dev_of_y = sd(y), 
    correlation_between_x_and_y = cor(x,y)
  )
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 7
##   dataset     N mean_of_x std_dev_of_x mean_of_y std_dev_of_y correlation_betwe~
##   <chr>   <int>     <dbl>        <dbl>     <dbl>        <dbl>              <dbl>
## 1 I          11         9         3.32      7.50         2.03              0.816
## 2 II         11         9         3.32      7.50         2.03              0.816
## 3 III        11         9         3.32      7.5          2.03              0.816
## 4 IV         11         9         3.32      7.50         2.03              0.817

Perception of correlation (2)

Estimating the value of the correlation coefficient between two quantities from their scatterplot can be tricky. Statisticians have shown that people’s perception of the strength of these relationships can be influenced by design choices like the x and y scales.

Nevertheless, with some practice your perception of correlation will improve. Toggle through the four scatterplots in the plotting window, each of which you’ve seen in a previous exercise. Jot down your best estimate of the value of the correlation coefficient between each pair of variables. Then, compare these values to the actual values you compute in this exercise.

If you’re having trouble recalling variable names, it may help to preview a dataset in the console with str() or glimpse().

colnames(mlbbat10)[5] <- "AB"
colnames(mlbbat10)[17] <- "OBP"
colnames(mlbbat10)[18] <- "SLG"

EXERCISES.

Draw the plot then calculate the correlation between OBP and SLG for all players in the mlbBat10 dataset.

# Run this and look at the plot
ggplot(data = mlbbat10, aes(x = OBP, y = SLG)) +
  geom_point()

# Correlation for all baseball players
mlbbat10 %>%
  summarize(N = n(), r = cor(OBP, SLG))
##      N         r
## 1 1199 0.8145628

Draw the plot then calculate the correlation between OBP and SLG for all players in the mlbBat10 dataset with at least 200 at-bats.

# Run this and look at the plot
mlbbat10 %>% 
    filter(AB > 200) %>%
    ggplot(aes(x = OBP, y = SLG)) + 
    geom_point()

# Correlation for all players with at least 200 ABs
mlbbat10 %>%
  filter(AB >= 200) %>%
  summarize(N = n(), r = cor(OBP, SLG))
##     N         r
## 1 329 0.6855364

Draw the plot then calculate the correlation between height and weight for each sex in the bdims dataset.

# Run this and look at the plot
ggplot(data = bdims, aes(x = hgt, y = wgt, color = factor(sex))) +
  geom_point() 

# Correlation of body dimensions
head(bdims)
##   bia_di bii_di bit_di che_de che_di elb_di wri_di kne_di ank_di sho_gi che_gi
## 1   42.9   26.0   31.5   17.7   28.0   13.1   10.4   18.8   14.1  106.2   89.5
## 2   43.7   28.5   33.5   16.9   30.8   14.0   11.8   20.6   15.1  110.5   97.0
## 3   40.1   28.2   33.3   20.9   31.7   13.9   10.9   19.7   14.1  115.1   97.5
## 4   44.3   29.9   34.0   18.4   28.2   13.9   11.2   20.9   15.0  104.5   97.0
## 5   42.5   29.9   34.0   21.5   29.4   15.2   11.6   20.7   14.9  107.5   97.5
## 6   43.3   27.0   31.5   19.6   31.3   14.0   11.5   18.8   13.9  119.8   99.9
##   wai_gi nav_gi hip_gi thi_gi bic_gi for_gi kne_gi cal_gi ank_gi wri_gi age
## 1   71.5   74.5   93.5   51.5   32.5   26.0   34.5   36.5   23.5   16.5  21
## 2   79.0   86.5   94.8   51.5   34.4   28.0   36.5   37.5   24.5   17.0  23
## 3   83.2   82.9   95.0   57.3   33.4   28.8   37.0   37.3   21.9   16.9  28
## 4   77.8   78.8   94.0   53.0   31.0   26.2   37.0   34.8   23.0   16.6  23
## 5   80.0   82.5   98.5   55.4   32.0   28.4   37.7   38.6   24.4   18.0  22
## 6   82.5   80.1   95.3   57.5   33.0   28.0   36.6   36.1   23.5   16.9  21
##    wgt   hgt sex
## 1 65.6 174.0   1
## 2 71.8 175.3   1
## 3 80.7 193.5   1
## 4 72.6 186.5   1
## 5 78.8 187.2   1
## 6 74.8 181.5   1
bdims %>%
  group_by(sex) %>%
  summarize(N = n(), r = cor(wgt, hgt))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 3
##     sex     N     r
##   <int> <int> <dbl>
## 1     0   260 0.431
## 2     1   247 0.535

Draw the plot then calculate the correlation between body weight and brain weight for all species of mammals. Alongside this computation, compute the correlation between the same two quantities after taking their natural logarithms.

colnames(mammals)[2] <- "BodyWt"
colnames(mammals)[3] <- "brain_wt"
# Run this and look at the plot
ggplot(data = mammals, aes(x = BodyWt, y = brain_wt)) +
  geom_point() + scale_x_log10() + scale_y_log10()

colnames(mammals)[2] <- "BodyWt"
colnames(mammals)[3] <- "brain_wt"
# Correlation among mammals, with and without log

mammals %>%
  summarize(N = n(), 
            r = cor(BodyWt, brain_wt), 
            r_log = cor(log(BodyWt), log(brain_wt)))
## # A tibble: 1 x 3
##       N     r r_log
##   <int> <dbl> <dbl>
## 1    62 0.934 0.960

Spurious correlations

Spurious correlation in random data

Statisticians must always be skeptical of potentially spurious correlations. Human beings are very good at seeing patterns in data, sometimes when the patterns themselves are actually just random noise. To illustrate how easy it can be to fall into this trap, we will look for patterns in truly random data.

The noise dataset contains 20 sets of x and y variables drawn at random from a standard normal distribution. Each set, denoted as z, has 50 observations of x, y pairs. Do you see any pairs of variables that might be meaningfully correlated? Are all of the correlation coefficients close to zero?

EXERCISES:

Create a faceted scatterplot that shows the relationship between each of the 20 sets of pairs of random variables x and y. You will need the facet_wrap() function for this.

noise <- read.csv("./Data/noise.csv", sep = ",")
head(noise)
##   X          x          y
## 1 1 -1.2070657 -1.2053334
## 2 2  0.2774292  0.3014667
## 3 3  1.0844412 -1.5391452
## 4 4 -2.3456977  0.6353707
## 5 5  0.4291247  0.7029518
## 6 6  0.5060559 -1.9058829
# Create faceted scatterplot
noise %>%
ggplot(aes(x,y)) +
geom_point()+
facet_wrap(~X)

Compute the actual correlation between each of the 20 sets of pairs of x and y.

# Compute correlations for each dataset
noise_summary <- noise %>%
  group_by(X) %>%
  summarize(N = n(), spurious_cor = cor(x, y))
## `summarise()` ungrouping output (override with `.groups` argument)

Identify the datasets that show non-trivial correlation of greater than 0.2 in absolute value.

# Isolate sets with correlations above 0.2 in absolute strength
noise_summary %>%
  filter(abs(spurious_cor) > 0.2)
## # A tibble: 0 x 3
## # ... with 3 variables: X <int>, N <int>, spurious_cor <dbl>

Simple linear regression

Visualization of Linear Models

The “best fit” line

The simple linear regression model for a numeric response as a function of a numeric explanatory variable can be visualized on the corresponding scatterplot by a straight line. This is a “best fit” line that cuts through the data in a way that minimizes the distance between the line and the data points.

We might consider linear regression to be a specific example of a larger class of smooth models. The geom_smooth() function allows you to draw such models over a scatterplot of the data itself. This technique is known as visualizing the model in the data space. The method argument to geom_smooth() allows you to specify what class of smooth model you want to see. Since we are exploring linear models, we’ll set this argument to the value “lm”.

Note that geom_smooth() also takes an se argument that controls the standard error, which we will ignore for now.

EXERCISE:

Create a scatterplot of body weight as a function of height for all individuals in the bdims dataset with a simple linear model plotted over the data.

# Scatterplot with regression line
head(bdims)
##   bia_di bii_di bit_di che_de che_di elb_di wri_di kne_di ank_di sho_gi che_gi
## 1   42.9   26.0   31.5   17.7   28.0   13.1   10.4   18.8   14.1  106.2   89.5
## 2   43.7   28.5   33.5   16.9   30.8   14.0   11.8   20.6   15.1  110.5   97.0
## 3   40.1   28.2   33.3   20.9   31.7   13.9   10.9   19.7   14.1  115.1   97.5
## 4   44.3   29.9   34.0   18.4   28.2   13.9   11.2   20.9   15.0  104.5   97.0
## 5   42.5   29.9   34.0   21.5   29.4   15.2   11.6   20.7   14.9  107.5   97.5
## 6   43.3   27.0   31.5   19.6   31.3   14.0   11.5   18.8   13.9  119.8   99.9
##   wai_gi nav_gi hip_gi thi_gi bic_gi for_gi kne_gi cal_gi ank_gi wri_gi age
## 1   71.5   74.5   93.5   51.5   32.5   26.0   34.5   36.5   23.5   16.5  21
## 2   79.0   86.5   94.8   51.5   34.4   28.0   36.5   37.5   24.5   17.0  23
## 3   83.2   82.9   95.0   57.3   33.4   28.8   37.0   37.3   21.9   16.9  28
## 4   77.8   78.8   94.0   53.0   31.0   26.2   37.0   34.8   23.0   16.6  23
## 5   80.0   82.5   98.5   55.4   32.0   28.4   37.7   38.6   24.4   18.0  22
## 6   82.5   80.1   95.3   57.5   33.0   28.0   36.6   36.1   23.5   16.9  21
##    wgt   hgt sex
## 1 65.6 174.0   1
## 2 71.8 175.3   1
## 3 80.7 193.5   1
## 4 72.6 186.5   1
## 5 78.8 187.2   1
## 6 74.8 181.5   1
ggplot(data = bdims, aes(x = hgt, y = wgt)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

Uniqueness of least squares regression line

The least squares criterion implies that the slope of the regression line is unique. In practice, the slope is computed by R. In this exercise, you will experiment with trying to find the optimal value for the regression slope for weight as a function of height in the bdims dataset via trial-and-error.

To help, we’ve built a custom function for you called add_line(), which takes a single argument: the proposed slope coefficient.

EXERCISES:

The bdims dataset is available in your workspace. Experiment with different values (to the nearest integer) of the my_slope parameter until you find one that you think fits best.

# Estimate optimal value of my_slope
#add_line(my_slope = 1)

Understanding Linear Models

Fitting a linear model “by hand”

Recall the simple linear regression model:

Two facts enable you to compute the slope and intercept of a simple linear regression model from some basic summary statistics.

First, the slope can be defined as:

where represents the correlation (cor()) of and and and represent the standard deviation (sd()) of and , respectively.

Second, the point is always on the least squares regression line, where and denote the average of and , respectively.

The bdims_summary data frame contains all of the information you need to compute the slope and intercept of the least squares regression line for body weight () as a function of height (). You might need to do some algebra to solve for !

EXERCISES:

Print the bdims_summary data frame.

# Print bdims_summary
bdims_summary <- bdims %>% 
  summarize(
    N = n(), 
    r = cor(hgt, wgt), 
    mean_hgt = mean(hgt), 
    mean_wgt = mean(wgt), 
    sd_hgt = sd(hgt), 
    sd_wgt = sd(wgt)
    )

Use mutate() to add the slope and intercept to the bdims_summary data frame.

# Add slope and intercept
bdims_summary %>%
  mutate(
    slope = r * sd_wgt/sd_hgt, 
    intercept = mean_wgt - (slope * mean_hgt)
    )
##     N         r mean_hgt mean_wgt   sd_hgt   sd_wgt    slope intercept
## 1 507 0.7173011 171.1438 69.14753 9.407205 13.34576 1.017617 -105.0113

Interpretation of regression coefficients

Fitting simple linear models

While the geom_smooth(method = “lm”) function is useful for drawing linear models on a scatterplot, it doesn’t actually return the characteristics of the model. As suggested by that syntax, however, the function that creates linear models is lm(). This function generally takes two arguments:

A formula that specifies the model A data argument for the data frame that contains the data you want to use to fit the model The lm() function return a model object having class “lm”. This object contains lots of information about your regression model, including the data used to fit the model, the specification of the model, the fitted values and residuals, etc.

EXERCISES:

Using the bdims dataset, create a linear model for the weight of people as a function of their height.

# Linear model for weight as a function of height
lm(wgt ~ hgt, data = bdims)
## 
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
## 
## Coefficients:
## (Intercept)          hgt  
##    -105.011        1.018

Using the mlbBat10 dataset, create a linear model for SLG as a function of OBP.

# Linear model for SLG as a function of OBP
lm(SLG ~ OBP, data = mlbbat10)
## 
## Call:
## lm(formula = SLG ~ OBP, data = mlbbat10)
## 
## Coefficients:
## (Intercept)          OBP  
##    0.009407     1.110323

Using the mammals dataset, create a linear model for the body weight of mammals as a function of their brain weight, after taking the natural log of both variables.

# Log-linear model for body weight as a function of brain weight
lm(log(BodyWt) ~ log(brain_wt), data = mammals)
## 
## Call:
## lm(formula = log(BodyWt) ~ log(brain_wt), data = mammals)
## 
## Coefficients:
##   (Intercept)  log(brain_wt)  
##        -2.509          1.225

Your linear model object

An “lm” object contains a host of information about the regression model that you fit. There are various ways of extracting different pieces of information.

The coef() function displays only the values of the coefficients. Conversely, the summary() function displays not only that information, but a bunch of other information, including the associated standard error and p-value for each coefficient, the , adjusted , and the residual standard error. The summary of an “lm” object in R is very similar to the output you would see in other statistical computing environments (e.g. Stata, SPSS, etc.)

EXERCISES:

We have already created the mod object, a linear model for the weight of individuals as a function of their height, using the bdims dataset and the code

mod <- lm(wgt ~ hgt, data = bdims) Now, you will:

Use coef() to display the coefficients of mod.

mod <- lm(wgt ~ hgt, data = bdims)

# Show the coefficients
coef(mod)
## (Intercept)         hgt 
## -105.011254    1.017617

Use summary() to display the full regression output of mod.

# Show the full output
summary(mod)
## 
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.743  -6.402  -1.231   5.059  41.103 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -105.01125    7.53941  -13.93   <2e-16 ***
## hgt            1.01762    0.04399   23.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared:  0.5145, Adjusted R-squared:  0.5136 
## F-statistic: 535.2 on 1 and 505 DF,  p-value: < 2.2e-16

Fitted values and residuals

Once you have fit a regression model, you are often interested in the fitted values ( ) and the residuals ( ), where indexes the observations. Recall that:

The least squares fitting procedure guarantees that the mean of the residuals is zero (n.b., numerical instability may result in the computed values not being exactly zero). At the same time, the mean of the fitted values must equal the mean of the response variable.

In this exercise, we will confirm these two mathematical facts by accessing the fitted values and residuals with the fitted.values() and residuals() functions, respectively, for the following model:

{r, eval=FALSE} mod <- lm(wgt ~ hgt, data = bdims)

EXERCISES:

mod (defined above) is available in your workspace.

# Mean of weights equal to mean of fitted values?
mean(fitted.values(mod)) == mean(bdims$wgt)
## [1] TRUE

Confirm that the mean of the body weights equals the mean of the fitted values of mod.

# Mean of the residuals
mean(residuals(mod))
## [1] -1.266971e-15

Compute the mean of the residuals of mod.

Tidying your linear model

As you fit a regression model, there are some quantities (e.g.  ) that apply to the model as a whole, while others apply to each observation (e.g.  ). If there are several of these per-observation quantities, it is sometimes convenient to attach them to the original data as new variables.

The augment() function from the broom package does exactly this. It takes a model object as an argument and returns a data frame that contains the data on which the model was fit, along with several quantities specific to the regression model, including the fitted values, residuals, leverage scores, and standardized residuals.

EXERCISES:

The same linear model from the last exercise, mod, is available in your workspace.

Load the broom package.

# Load broom
library(broom)

Create a new data frame called bdims_tidy that is the augmentation of the mod linear model.

# Create bdims_tidy
bdims_tidy <- augment(mod)

View the bdims_tidy data frame using glimpse().

# Glimpse the resulting data frame
glimpse(bdims_tidy)
## Rows: 507
## Columns: 8
## $ wgt        <dbl> 65.6, 71.8, 80.7, 72.6, 78.8, 74.8, 86.4, 78.4, 62.0, 81...
## $ hgt        <dbl> 174.0, 175.3, 193.5, 186.5, 187.2, 181.5, 184.0, 184.5, ...
## $ .fitted    <dbl> 72.05406, 73.37697, 91.89759, 84.77427, 85.48661, 79.686...
## $ .resid     <dbl> -6.4540648, -1.5769666, -11.1975919, -12.1742745, -6.686...
## $ .std.resid <dbl> -0.69413418, -0.16961994, -1.21098084, -1.31269063, -0.7...
## $ .hat       <dbl> 0.002154570, 0.002358152, 0.013133942, 0.007238576, 0.00...
## $ .sigma     <dbl> 9.312824, 9.317005, 9.303732, 9.301360, 9.312471, 9.3147...
## $ .cooksd    <dbl> 5.201807e-04, 3.400330e-05, 9.758463e-03, 6.282074e-03, ...

Using your linear model

Making predictions

The fitted.values() function or the augment()-ed data frame provides us with the fitted values for the observations that were in the original data. However, once we have fit the model, we may want to compute expected values for observations that were not present in the data on which the model was fit. These types of predictions are called out-of-sample.

The ben data frame contains a height and weight observation for one person. The mod object contains the fitted model for weight as a function of height for the observations in the bdims dataset. We can use the predict() function to generate expected values for the weight of new individuals. We must pass the data frame of new observations through the newdata argument.

EXERCISES.

The same linear model, mod, is defined in your workspace.

Print ben to the console.

# Print ben
ben <- data.frame(wgt = 74.8, hgt = 182.8)
ben
##    wgt   hgt
## 1 74.8 182.8

Use predict() with the newdata argument to compute the expected weight of the individual in the ben data frame.

mod <- lm(wgt ~ hgt, data = bdims)
 
# Predict the weight of ben
predict(mod, newdata = ben)
##        1 
## 81.00909

Adding a regression line to a plot manually

The geom_smooth() function makes it easy to add a simple linear regression line to a scatterplot of the corresponding variables. And in fact, there are more complicated regression models that can be visualized in the data space with geom_smooth(). However, there may still be times when we will want to add regression lines to our scatterplot manually. To do this, we will use the geom_abline() function, which takes slope and intercept arguments. Naturally, we have to compute those values ahead of time, but we already saw how to do this (e.g. using coef()).

The coefs data frame contains the model estimates retrieved from coef(). Passing this to geom_abline() as the data argument will enable you to draw a straight line on your scatterplot.

EXERCISES:

Use geom_abline() to add a line defined in the coefs data frame to a scatterplot of weight vs. height for individuals in the bdims dataset.

coefs <- coef(mod)
# Add the line to the scatterplot
head(coefs)
## (Intercept)         hgt 
## -105.011254    1.017617
#ggplot(data = bdims, aes(x = hgt, y = wgt)) + 
 # geom_point() + 
 # geom_abline(data = coefs, 
             # aes(intercept = `(Intercept)`, slope = hgt),  
             # color = "dodgerblue")

Model Fit

Assessing Model Fit

Standard error of residuals (RMSE)

One way to assess strength of fit is to consider how far off the model is for a typical case. That is, for some observations, the fitted value will be very close to the actual value, while for others it will not. The magnitude of a typical residual can give us a sense of generally how close our estimates are.

However, recall that some of the residuals are positive, while others are negative. In fact, it is guaranteed by the least squares fitting procedure that the mean of the residuals is zero. Thus, it makes more sense to compute the square root of the mean squared residual, or root mean squared error (). R calls this quantity the residual standard error.

To make this estimate unbiased, you have to divide the sum of the squared residuals by the degrees of freedom in the model. Thus,

You can recover the residuals from mod with residuals(), and the degrees of freedom with df.residual().

EXERCISES.

View a summary() of mod.

# View summary of model
summary(mod)
## 
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.743  -6.402  -1.231   5.059  41.103 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -105.01125    7.53941  -13.93   <2e-16 ***
## hgt            1.01762    0.04399   23.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared:  0.5145, Adjusted R-squared:  0.5136 
## F-statistic: 535.2 on 1 and 505 DF,  p-value: < 2.2e-16

Compute the mean of the residuals() and verify that it is approximately zero.

# Compute the mean of the residuals
mean(residuals(mod))
## [1] -1.266971e-15

Use residuals() and df.residual() to compute the root mean squared error (RMSE), a.k.a. residual standard error.

# Compute RMSE
sqrt(sum(residuals(mod)^2) / df.residual(mod))
## [1] 9.30804

Comparing model fits

Assessing simple linear model fit

Recall that the coefficient of determination ( ), can be computed as

where is the vector of residuals and is the response variable. This gives us the interpretation of as the percentage of the variability in the response that is explained by the model, since the residuals are the part of that variability that remains unexplained by the model.

EXERCISES.

The bdims_tidy data frame is the result of augment()-ing the bdims data frame with the mod for wgt as a function of hgt.

# View model summary
str(bdims_tidy)
## tibble [507 x 8] (S3: tbl_df/tbl/data.frame)
##  $ wgt       : num [1:507] 65.6 71.8 80.7 72.6 78.8 74.8 86.4 78.4 62 81.6 ...
##  $ hgt       : num [1:507] 174 175 194 186 187 ...
##  $ .fitted   : num [1:507] 72.1 73.4 91.9 84.8 85.5 ...
##  $ .resid    : num [1:507] -6.45 -1.58 -11.2 -12.17 -6.69 ...
##  $ .std.resid: num [1:507] -0.694 -0.17 -1.211 -1.313 -0.721 ...
##  $ .hat      : num [1:507] 0.00215 0.00236 0.01313 0.00724 0.00773 ...
##  $ .sigma    : num [1:507] 9.31 9.32 9.3 9.3 9.31 ...
##  $ .cooksd   : num [1:507] 0.00052 0.000034 0.009758 0.006282 0.002026 ...
##  - attr(*, "terms")=Classes 'terms', 'formula'  language wgt ~ hgt
##   .. ..- attr(*, "variables")= language list(wgt, hgt)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "wgt" "hgt"
##   .. .. .. ..$ : chr "hgt"
##   .. ..- attr(*, "term.labels")= chr "hgt"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(wgt, hgt)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "wgt" "hgt"

Use the summary() function to view the full results of mod.

summary(mod)
## 
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.743  -6.402  -1.231   5.059  41.103 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -105.01125    7.53941  -13.93   <2e-16 ***
## hgt            1.01762    0.04399   23.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared:  0.5145, Adjusted R-squared:  0.5136 
## F-statistic: 535.2 on 1 and 505 DF,  p-value: < 2.2e-16

Use the bdims_tidy data frame to compute the of mod manually using the formula above, by computing the ratio of the variance of the residuals to the variance of the response variable.

# Compute R-squared
bdims_tidy %>%
  summarize(
    var_y = var(wgt), 
    var_e = var(.resid)
    ) %>%
  mutate(R_squared = 1 - (var_e/var_y))
## # A tibble: 1 x 3
##   var_y var_e R_squared
##   <dbl> <dbl>     <dbl>
## 1  178.  86.5     0.515

Linear vs. average

The gives us a numerical measurement of the strength of fit relative to a null model based on the average of the response variable:

This model has an of zero because . That is, since the fitted values ( ) are all equal to the average ( ), the residual for each observation is the distance between that observation and the mean of the response. Since we can always fit the null model, it serves as a baseline against which all other models will be compared.

In the graphic, we visualize the residuals for the null model (mod_null at left) vs. the simple linear regression model (mod_hgt at right) with height as a single explanatory variable. Try to convince yourself that, if you squared the lengths of the grey arrows on the left and summed them up, you would get a larger value than if you performed the same operation on the grey arrows on the right.

It may be useful to preview these augment()-ed data frames with glimpse():

glimpse(mod_null) glimpse(mod_hgt)

EXERCISES.

Compute the sum of the squared residuals (SSE) for the null model mod_null.

# Compute SSE for null model
mod_null <- lm(wgt ~ 1, data = bdims) %>%
  augment()
head(mod_null)
## # A tibble: 6 x 7
##     wgt .fitted .resid .std.resid    .hat .sigma   .cooksd
##   <dbl>   <dbl>  <dbl>      <dbl>   <dbl>  <dbl>     <dbl>
## 1  65.6    69.1  -3.55     -0.266 0.00197   13.4 0.000140 
## 2  71.8    69.1   2.65      0.199 0.00197   13.4 0.0000782
## 3  80.7    69.1  11.6       0.866 0.00197   13.3 0.00148  
## 4  72.6    69.1   3.45      0.259 0.00197   13.4 0.000133 
## 5  78.8    69.1   9.65      0.724 0.00197   13.4 0.00104  
## 6  74.8    69.1   5.65      0.424 0.00197   13.4 0.000355

Compute the sum of the squared residuals (SSE) for the regression model mod_hgt

mod_null %>%
  summarize(SSE = var(.resid))
## # A tibble: 1 x 1
##     SSE
##   <dbl>
## 1  178.
# Compute SSE for regression model
mod_hgt <- mod %>%
  augment()

head(mod_hgt)
## # A tibble: 6 x 8
##     wgt   hgt .fitted .resid .std.resid    .hat .sigma   .cooksd
##   <dbl> <dbl>   <dbl>  <dbl>      <dbl>   <dbl>  <dbl>     <dbl>
## 1  65.6  174     72.1  -6.45     -0.694 0.00215   9.31 0.000520 
## 2  71.8  175.    73.4  -1.58     -0.170 0.00236   9.32 0.0000340
## 3  80.7  194.    91.9 -11.2      -1.21  0.0131    9.30 0.00976  
## 4  72.6  186.    84.8 -12.2      -1.31  0.00724   9.30 0.00628  
## 5  78.8  187.    85.5  -6.69     -0.721 0.00773   9.31 0.00203  
## 6  74.8  182.    79.7  -4.89     -0.526 0.00437   9.31 0.000607
mod_hgt %>%
  summarize(SSE = var(.resid))
## # A tibble: 1 x 1
##     SSE
##   <dbl>
## 1  86.5

Unusual Points:

Leverage

The leverage of an observation in a regression model is defined entirely in terms of the distance of that observation from the mean of the explanatory variable. That is, observations close to the mean of the explanatory variable have low leverage, while observations far from the mean of the explanatory variable have high leverage. Points of high leverage may or may not be influential.

The augment() function from the broom package will add the leverage scores (.hat) to a model data frame.

EXERCISES:

Use augment() to list the top 6 observations by their leverage scores, in descending order.

# Rank points of high leverage
mod <- lm(formula = SLG ~ OBP, data = filter(mlbbat10, AB >= 10))

mod %>%
  augment %>%
  arrange(desc(.hat)) %>%
  head()
## # A tibble: 6 x 8
##     SLG   OBP .fitted  .resid .std.resid   .hat .sigma  .cooksd
##   <dbl> <dbl>   <dbl>   <dbl>      <dbl>  <dbl>  <dbl>    <dbl>
## 1 0     0     -0.0374  0.0374      0.529 0.0194 0.0715 0.00277 
## 2 0     0     -0.0374  0.0374      0.529 0.0194 0.0715 0.00277 
## 3 0     0     -0.0374  0.0374      0.529 0.0194 0.0715 0.00277 
## 4 0.308 0.55   0.690  -0.382      -5.39  0.0164 0.0701 0.243   
## 5 0     0.037  0.0115 -0.0115     -0.162 0.0150 0.0715 0.000202
## 6 0.038 0.038  0.0128  0.0252      0.354 0.0149 0.0715 0.000953

Influence

As noted previously, observations of high leverage may or may not be influential. The influence of an observation depends not only on its leverage, but also on the magnitude of its residual. Recall that while leverage only takes into account the explanatory variable (), the residual depends on the response variable () and the fitted value ( ).

Influential points are likely to have high leverage and deviate from the general relationship between the two variables. We measure influence using Cook’s distance, which incorporates both the leverage and residual of each observation.

EXERCISES:

Use augment() to list the top 6 observations by their Cook’s distance (.cooksd), in descending order.

# Rank influential points
mod %>%
  augment() %>%
  arrange(desc(.cooksd)) %>%
  head()
## # A tibble: 6 x 8
##     SLG   OBP .fitted .resid .std.resid    .hat .sigma .cooksd
##   <dbl> <dbl>   <dbl>  <dbl>      <dbl>   <dbl>  <dbl>   <dbl>
## 1 0.308 0.55   0.690  -0.382      -5.39 0.0164  0.0701  0.243 
## 2 0.833 0.385  0.472   0.361       5.06 0.00344 0.0703  0.0441
## 3 0.8   0.455  0.565   0.235       3.30 0.00749 0.0710  0.0411
## 4 0.379 0.133  0.139   0.240       3.37 0.00656 0.0710  0.0376
## 5 0.786 0.438  0.542   0.244       3.42 0.00631 0.0710  0.0371
## 6 0.231 0.077  0.0645  0.167       2.34 0.0110  0.0713  0.0306

Dealing with unusual points

Removing outliers

Observations can be outliers for a number of different reasons. Statisticians must always be careful—and more importantly, transparent—when dealing with outliers. Sometimes, a better model fit can be achieved by simply removing outliers and re-fitting the model. However, one must have strong justification for doing this. A desire to have a higher is not a good enough reason!

In the mlbBat10 data, the outlier with an OBP of 0.550 is Bobby Scales, an infielder who had four hits in 13 at-bats for the Chicago Cubs. Scales also walked seven times, resulting in his unusually high OBP. The justification for removing Scales here is weak. While his performance was unusual, there is nothing to suggest that it is not a valid data point, nor is there a good reason to think that somehow we will learn more about Major League Baseball players by excluding him.

Nevertheless, we can demonstrate how removing him will affect our model.

EXERCISES.

Use filter() to create a subset of mlbBat10 called nontrivial_players consisting of only those players with at least 10 at-bats and OBP of below 0.500.

# Create nontrivial_players
nontrivial_players <- mlbbat10 %>%
  filter(AB >= 10, OBP < 0.5)

Fit the linear model for SLG as a function of OBP for the nontrivial_players. Save the result as mod_cleaner.

# Fit model to new data
mod_cleaner <- lm(SLG ~ OBP, data = nontrivial_players)

View the summary() of the new model and compare the slope and to those of mod, the original model fit to the data on all players.

# View model summary
summary(mod_cleaner)
## 
## Call:
## lm(formula = SLG ~ OBP, data = nontrivial_players)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.31383 -0.04165 -0.00261  0.03992  0.35819 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.043326   0.009823  -4.411 1.18e-05 ***
## OBP          1.345816   0.033012  40.768  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07011 on 734 degrees of freedom
## Multiple R-squared:  0.6937, Adjusted R-squared:  0.6932 
## F-statistic:  1662 on 1 and 734 DF,  p-value: < 2.2e-16

Visualize the new model with ggplot() and the appropriate geom_*() functions.

# Visualize new model
ggplot(data = nontrivial_players, aes(x = OBP, y = SLG)) +
  geom_point() + 
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

High leverage points

Not all points of high leverage are influential. While the high leverage observation corresponding to Bobby Scales in the previous exercise is influential, the three observations for players with OBP and SLG values of 0 are not influential.

This is because they happen to lie right near the regression anyway. Thus, while their extremely low OBP gives them the power to exert influence over the slope of the regression line, their low SLG prevents them from using it.

EXERCISES.

The linear model, mod, is available in your workspace. Use a combination of augment(), arrange() with two arguments, and head() to find the top 6 observations with the highest leverage but the lowest Cook’s distance.

# Rank high leverage points
mod %>%
  augment() %>%
  arrange(desc(.hat), .cooksd) %>%
  head()
## # A tibble: 6 x 8
##     SLG   OBP .fitted  .resid .std.resid   .hat .sigma  .cooksd
##   <dbl> <dbl>   <dbl>   <dbl>      <dbl>  <dbl>  <dbl>    <dbl>
## 1 0     0     -0.0374  0.0374      0.529 0.0194 0.0715 0.00277 
## 2 0     0     -0.0374  0.0374      0.529 0.0194 0.0715 0.00277 
## 3 0     0     -0.0374  0.0374      0.529 0.0194 0.0715 0.00277 
## 4 0.308 0.55   0.690  -0.382      -5.39  0.0164 0.0701 0.243   
## 5 0     0.037  0.0115 -0.0115     -0.162 0.0150 0.0715 0.000202
## 6 0.038 0.038  0.0128  0.0252      0.354 0.0149 0.0715 0.000953

Conclusioon

Pretty good class. Definitely hits the basics of modeling (linear models) in R. There is a lot more that needs to be understood to really know I am taking the right steps to fit a correct model to a real world dataset.

17.- Supervised Learning in R: Regression

Linear regression - the fundamental method

Code a simple one-variable regression

For the first coding exercise, you’ll create a formula to define a one-variable modeling task, and then fit a linear model to the data. You are given the rates of male and female unemployment in the United States over several years (Source).

The task is to predict the rate of female unemployment from the observed rate of male unemployment. The outcome is female_unemployment, and the input is male_unemployment.

The sign of the variable coefficient tells you whether the outcome increases (+) or decreases (-) as the variable increases.

Recall the calling interface for lm() is:

lm(formula, data = ___)

EXERCISE:

The data frame unemployment is in your workspace.

Define a formula that expresses female_unemployment as a function of male_unemployment. Assign the formula to the variable fmla and print it. Then use lm() and fmla to fit a linear model to predict female unemployment from male unemployment using the data set unemployment. Print the model. Is the coefficent for male unemployment consistent with what you would expect? Does female unemployment increase as male unemployment does?

unemployment <- readRDS("./Data/unemployment.rds")
# unemployment is loaded in the workspace
summary(unemployment)
##  male_unemployment female_unemployment
##  Min.   :2.900     Min.   :4.000      
##  1st Qu.:4.900     1st Qu.:4.400      
##  Median :6.000     Median :5.200      
##  Mean   :5.954     Mean   :5.569      
##  3rd Qu.:6.700     3rd Qu.:6.100      
##  Max.   :9.800     Max.   :7.900
# Define a formula to express female_unemployment as a function of male_unemployment
fmla <- female_unemployment ~ male_unemployment

# Print it
fmla
## female_unemployment ~ male_unemployment
# Use the formula to fit a model: unemployment_model
unemployment_model <-  lm(fmla, data = unemployment)

# Print it 
unemployment_model
## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Coefficients:
##       (Intercept)  male_unemployment  
##            1.4341             0.6945

Examining a model

Let’s look at the model unemployment_model that you have just created. There are a variety of different ways to examine a model; each way provides different information. We will use summary(), broom::glance(), and sigr::wrapFTest().

EXERCISE:

The object unemployment_model is in your workspace.

Print unemployment_model again. What information does it report? Call summary() on unemployment_model. In addition to the coefficient values, you get standard errors on the coefficient estimates, and some goodness-of-fit metrics like R-squared. Call glance() on the model to see the performance metrics in an orderly data frame. Can you match the information from summary() to the columns of glance()? Now call wrapFTest() on the model to see the R-squared again.

# Print unemployment_model
unemployment_model
## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Coefficients:
##       (Intercept)  male_unemployment  
##            1.4341             0.6945
# Call summary() on unemployment_model to get more details
summary(unemployment_model)
## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77621 -0.34050 -0.09004  0.27911  1.31254 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.43411    0.60340   2.377   0.0367 *  
## male_unemployment  0.69453    0.09767   7.111 1.97e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared:  0.8213, Adjusted R-squared:  0.8051 
## F-statistic: 50.56 on 1 and 11 DF,  p-value: 1.966e-05
library(broom )
# Call glance() on unemployment_model to see the details in a tidier form
glance(unemployment_model)
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.821         0.805 0.580      50.6 1.97e-5     1  -10.3  26.6  28.3
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# Call wrapFTest() on unemployment_model to see the most relevant details
#wrapFTest(unemployment_model)

Predicting once you fit a model

Predicting from the unemployment model

In this exercise, you will use your unemployment model unemployment_model to make predictions from the unemployment data, and compare predicted female unemployment rates to the actual observed female unemployment rates on the training data, unemployment. You will also use your model to predict on the new data in newrates, which consists of only one observation, where male unemployment is 5%.

The predict() interface for lm models takes the form

predict(model, newdata) You will use the ggplot2 package to make the plots, so you will add the prediction column to the unemployment data frame. You will plot outcome versus prediction, and compare them to the line that represents perfect predictions (that is when the outcome is equal to the predicted value).

The ggplot2 command to plot a scatterplot of dframe\(outcome versus dframe\)pred (pred on the x axis, outcome on the y axis), along with a blue line where outcome == pred is as follows:

ggplot(dframe, aes(x = pred, y = outcome)) + geom_point() +
geom_abline(color = “blue”)

EXERCISE:

The objects unemployment, unemployment_model and newrates are in your workspace.

Use predict() to predict female unemployment rates from the unemployment data. Assign it to a new column: prediction. Use the library() command to load the ggplot2 package. Use ggplot() to compare the predictions to actual unemployment rates. Put the predictions on the x axis. How close are the results to the line of perfect prediction? Use the data frame newrates to predict expected female unemployment rate when male unemployment is 5%. Assign the answer to the variable pred and print it.

# unemployment is in your workspace
summary(unemployment)
##  male_unemployment female_unemployment
##  Min.   :2.900     Min.   :4.000      
##  1st Qu.:4.900     1st Qu.:4.400      
##  Median :6.000     Median :5.200      
##  Mean   :5.954     Mean   :5.569      
##  3rd Qu.:6.700     3rd Qu.:6.100      
##  Max.   :9.800     Max.   :7.900
# newrates is in your workspace
newrates <- read.csv("./Data/new_rates.csv", sep = ";")
colnames(newrates)[1] <- "male_unemployment"
# Predict female unemployment rate when male unemployment is 5%
pred <- predict(unemployment_model, newdata = newrates)
# Print it
pred
##        1 
## 4.906757

Multivariate linear regression (Part 1)

In this exercise, you will work with the blood pressure dataset (Source), and model blood_pressure as a function of weight and age.

EXERCISE:

The data frame bloodpressure is in the workspace.

Define a formula that expresses blood_pressure explicitly as a function of age and weight. Assign the formula to the variable fmla and print it. Use fmla to fit a linear model to predict blood_pressure from age and weight in the data set bloodpressure. Call the model bloodpressure_model. Print the model and call summary() on it. Does blood pressure increase or decrease with age? With weight?

bloodpressure <- read.csv("./Data/bloodpressure.csv", sep = ";", col.names = c("bloodpressure", "age", "weight" ))
names(bloodpressure)
## [1] "bloodpressure" "age"           "weight"
# bloodpressure is in the workspace
summary(bloodpressure)
##  bloodpressure        age            weight   
##  Min.   :128.0   Min.   :46.00   Min.   :167  
##  1st Qu.:140.0   1st Qu.:56.50   1st Qu.:186  
##  Median :153.0   Median :64.00   Median :194  
##  Mean   :150.1   Mean   :62.45   Mean   :195  
##  3rd Qu.:160.5   3rd Qu.:69.50   3rd Qu.:209  
##  Max.   :168.0   Max.   :74.00   Max.   :220
# Create the formula and print it
fmla <- bloodpressure ~ age + weight
fmla
## bloodpressure ~ age + weight
# Fit the model: bloodpressure_model
bloodpressure_model <- lm(fmla, data = bloodpressure)

# Print bloodpressure_model and call summary()
bloodpressure_model
## 
## Call:
## lm(formula = fmla, data = bloodpressure)
## 
## Coefficients:
## (Intercept)          age       weight  
##     30.9941       0.8614       0.3349
summary(bloodpressure_model)
## 
## Call:
## lm(formula = fmla, data = bloodpressure)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4640 -1.1949 -0.4078  1.8511  2.6981 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  30.9941    11.9438   2.595  0.03186 * 
## age           0.8614     0.2482   3.470  0.00844 **
## weight        0.3349     0.1307   2.563  0.03351 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.318 on 8 degrees of freedom
## Multiple R-squared:  0.9768, Adjusted R-squared:  0.9711 
## F-statistic: 168.8 on 2 and 8 DF,  p-value: 2.874e-07

Multivariate linear regression (Part 2)

Now you will make predictions using the blood pressure model bloodpressure_model that you fit in the previous exercise.

You will also compare the predictions to outcomes graphically. ggplot2 is already loaded in your workspace. Recall the plot command takes the form:

ggplot(dframe, aes(x = pred, y = outcome)) + geom_point() + geom_abline(color = “blue”)

EXERCISE:

The objects bloodpressure and bloodpressure_model are in the workspace.

Use predict() to predict blood pressure in the bloodpressure dataset. Assign the predictions to the column prediction. Graphically compare the predictions to actual blood pressures. Put predictions on the x axis. How close are the results to the line of perfect prediction?

# bloodpressure_model is in your workspace
bloodpressure_model
## 
## Call:
## lm(formula = fmla, data = bloodpressure)
## 
## Coefficients:
## (Intercept)          age       weight  
##     30.9941       0.8614       0.3349
# predict blood pressure using bloodpressure_model :prediction
bloodpressure$prediction <- predict(bloodpressure_model, data = bloodpressure)
head(bloodpressure)
##   bloodpressure age weight prediction
## 1           132  52    173   133.7183
## 2           143  59    184   143.4317
## 3           153  67    194   153.6716
## 4           162  73    211   164.5327
## 5           154  64    196   151.7570
## 6           168  74    220   168.4078
# plot the results
ggplot(bloodpressure,aes(prediction, bloodpressure  )) + 
    geom_point() +
    geom_abline(color = "blue")

Training and Evaluating Regression Models

Graphically evaluate the unemployment model

In this exercise you will graphically evaluate the unemployment model, unemployment_model, that you fit to the unemployment data in the previous chapter. Recall that the model predicts female_unemployment from male_unemployment.

You will plot the model’s predictions against the actual female_unemployment; recall the command is of the form

ggplot(dframe, aes(x = pred, y = outcome)) + geom_point() +
geom_abline() Then you will calculate the residuals:

residuals <- actual outcome - predicted outcome

and plot predictions against residuals. The residual graph will take a slightly different form: you compare the residuals to the horizontal line (using geom_hline()) rather than to the line . The command will be provided.

The data frame unemployment and model unemployment_model are available in the workspace.

EXERCISE:

Use predict() to get the model predictions and add them to unemployment as the column predictions. Plot predictions (on the x-axis) versus actual female unemployment rates. Are the predictions near the line?

# Make predictions from the model
unemployment$predictions <- predict(unemployment_model, data = unemployment)

# Fill in the blanks to plot predictions (on x-axis) versus the female_unemployment rates
ggplot(unemployment, aes(x = predictions, y = female_unemployment  )) + 
  geom_point() + 
  geom_abline()

Calculate the residuals between the predictions and actual unemployment rates. Add these residuals to unemployment as the column residuals. Fill in the blanks to plot predictions (on the x-axis) versus residuals (on the y-axis). This gives you a different view of the model’s predictions as compared to ground truth.

# From previous step
unemployment$predictions <- predict(unemployment_model)

# Calculate residuals
unemployment$residuals <- unemployment$female_unemployment - unemployment$predictions
# Fill in the blanks to plot predictions (on x-axis) versus the residuals
ggplot(unemployment, aes(x = predictions, y = residuals)) + 
  geom_pointrange(aes(ymin = 0, ymax = residuals)) + 
  geom_hline(yintercept = 0, linetype = 3) + 
  ggtitle("residuals vs. linear model prediction")

The gain curve to evaluate the unemployment model

In the previous exercise you made predictions about female_unemployment and visualized the predictions and the residuals. Now, you will also plot the gain curve of the unemployment_model’s predictions against actual female_unemployment using the WVPlots::GainCurvePlot() function.

For situations where order is more important than exact values, the gain curve helps you check if the model’s predictions sort in the same order as the true outcome.

Calls to the function GainCurvePlot() look like:

GainCurvePlot(frame, xvar, truthvar, title) where

frame is a data frame xvar and truthvar are strings naming the prediction and actual outcome columns of frame title is the title of the plot When the predictions sort in exactly the same order, the relative Gini coefficient is 1. When the model sorts poorly, the relative Gini coefficient is close to zero, or even negative.

EXERCISE:

# unemployment is in the workspace (with predictions)
summary(unemployment)
##  male_unemployment female_unemployment  predictions      residuals       
##  Min.   :2.900     Min.   :4.000       Min.   :3.448   Min.   :-0.77621  
##  1st Qu.:4.900     1st Qu.:4.400       1st Qu.:4.837   1st Qu.:-0.34050  
##  Median :6.000     Median :5.200       Median :5.601   Median :-0.09004  
##  Mean   :5.954     Mean   :5.569       Mean   :5.569   Mean   : 0.00000  
##  3rd Qu.:6.700     3rd Qu.:6.100       3rd Qu.:6.087   3rd Qu.: 0.27911  
##  Max.   :9.800     Max.   :7.900       Max.   :8.240   Max.   : 1.31254
# unemployment_model is in the workspace
summary(unemployment_model)
## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77621 -0.34050 -0.09004  0.27911  1.31254 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.43411    0.60340   2.377   0.0367 *  
## male_unemployment  0.69453    0.09767   7.111 1.97e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared:  0.8213, Adjusted R-squared:  0.8051 
## F-statistic: 50.56 on 1 and 11 DF,  p-value: 1.966e-05
library(dplyr)
library(tidyverse)
library(magrittr)
library(broom)
library(ggplot2)
library(readxl)
library(gdata)
library(jsonlite)
library(haven)
library(foreign)
library(broom)
library(sigr)
library(WVPlots)
library(vtreat)
library(Sleuth3)
library(stringr)
library(tidyr)
library(ranger)
library(vtreat)
library(ggthemes)
library(dslabs)
library(RColorBrewer)
library(WVPlots)
# Plot the Gain Curve
GainCurvePlot(unemployment, "predictions", "female_unemployment", "Unemployment model")

Root Mean Squared Error (RMSE)

Calculate RMSE

In this exercise you will calculate the RMSE of your unemployment model. In the previous coding exercises, you added two columns to the unemployment dataset:

the model’s predictions (predictions column) the residuals between the predictions and the outcome (residuals column) You can calculate the RMSE from a vector of residuals, , as:

You want RMSE to be small. How small is “small”? One heuristic is to compare the RMSE to the standard deviation of the outcome. With a good model, the RMSE should be smaller.

EXERCISE:

The data frame unemployment is in your workspace.

Review the unemployment data from the previous exercise. For convenience, assign the residuals column from unemployment to the variable res. Calculate RMSE: square res, take its mean, and then square root it. Assign this to the variable rmse and print it. Tip: you can do this in one step by wrapping the assignment in parentheses: (rmse <- ___) Calculate the standard deviation of female_unemployment and assign it to the variable sd_unemployment. Print it. How does the rmse of the model compare to the standard deviation of the data?

# unemployment is in the workspace
summary(unemployment)
##  male_unemployment female_unemployment  predictions      residuals       
##  Min.   :2.900     Min.   :4.000       Min.   :3.448   Min.   :-0.77621  
##  1st Qu.:4.900     1st Qu.:4.400       1st Qu.:4.837   1st Qu.:-0.34050  
##  Median :6.000     Median :5.200       Median :5.601   Median :-0.09004  
##  Mean   :5.954     Mean   :5.569       Mean   :5.569   Mean   : 0.00000  
##  3rd Qu.:6.700     3rd Qu.:6.100       3rd Qu.:6.087   3rd Qu.: 0.27911  
##  Max.   :9.800     Max.   :7.900       Max.   :8.240   Max.   : 1.31254
# For convenience put the residuals in the variable res
res <- unemployment$residuals
# Calculate RMSE, assign it to the variable rmse and print it
(rmse <- sqrt(mean(res^2)))
## [1] 0.5337612
# Calculate the standard deviation of female_unemployment and print it
(sd_unemployment <- sd(unemployment$female_unemployment))
## [1] 1.314271

R-Squared: R^2

Calculate R-Squared

Now that you’ve calculated the RMSE of your model’s predictions, you will examine how well the model fits the data: that is, how much variance does it explain. You can do this using .

Suppose is the true outcome, is the prediction from the model, and are the residuals of the predictions.

After you calculate , you will compare what you computed with the reported by glance(). glance() returns a one-row data frame; for a linear regression model, one of the columns returned is the of the model on the training data.

The data frame unemployment is in your workspace, with the columns predictions and residuals that you calculated in a previous exercise.

EXERCISE:

The data frame unemployment and the model unemployment_model are in the workspace.

Calculate the mean female_unemployment and assign it to the variable fe_mean.

# unemployment is in your workspace
summary(unemployment)
##  male_unemployment female_unemployment  predictions      residuals       
##  Min.   :2.900     Min.   :4.000       Min.   :3.448   Min.   :-0.77621  
##  1st Qu.:4.900     1st Qu.:4.400       1st Qu.:4.837   1st Qu.:-0.34050  
##  Median :6.000     Median :5.200       Median :5.601   Median :-0.09004  
##  Mean   :5.954     Mean   :5.569       Mean   :5.569   Mean   : 0.00000  
##  3rd Qu.:6.700     3rd Qu.:6.100       3rd Qu.:6.087   3rd Qu.: 0.27911  
##  Max.   :9.800     Max.   :7.900       Max.   :8.240   Max.   : 1.31254
# unemployment_model is in the workspace
summary(unemployment_model)
## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77621 -0.34050 -0.09004  0.27911  1.31254 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.43411    0.60340   2.377   0.0367 *  
## male_unemployment  0.69453    0.09767   7.111 1.97e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared:  0.8213, Adjusted R-squared:  0.8051 
## F-statistic: 50.56 on 1 and 11 DF,  p-value: 1.966e-05
# Calculate mean female_unemployment: fe_mean. Print it
(fe_mean <- mean(unemployment$female_unemployment))
## [1] 5.569231

Calculate the total sum of squares and assign it to the variable tss.

# Calculate total sum of squares: tss. Print it
(tss <- sum( (unemployment$female_unemployment - fe_mean)^2 ))
## [1] 20.72769

Calculate the residual sum of squares and assign it to the variable rss.

# Calculate residual sum of squares: rss. Print it
(rss <- sum(unemployment$residuals^2))
## [1] 3.703714
# Calculate R-squared: rsq. Print it. Is it a good fit?
(rsq <- 1 - (rss/tss))
## [1] 0.8213157

Calculate . Is it a good fit ( near 1)?

It looks like a great fit as it is quite close to 1, 0.8213157

Use glance() to get from the model. Is it the same as what you calculated?

# Get R-squared from glance. Print it
(rsq_glance <- glance(unemployment_model)$r.squared)
## [1] 0.8213157

Correlation and R-squared

The linear correlation of two variables, and , measures the strength of the linear relationship between them. When and are respectively:

-the outcomes of a regression model that minimizes squared-error (like linear regression) and -the true outcomes of the training data, then the square of the correlation is the same as . You will verify that in this exercise.

EXERCISE:

Use cor() to get the correlation between the predictions and female unemployment. Assign it to the variable rho and print it. Make sure you use Pearson correlation (the default).

# Get the correlation between the prediction and true outcome: rho and print it
(rho <- cor(unemployment$predictions, unemployment$female_unemployment))
## [1] 0.9062647

Square rho and assign it to rho2. Print it.

# Square rho: rho2 and print it
(rho2 <- rho ^ 2)
## [1] 0.8213157

Compare rho2 to from the model (using glance()). Is it the same?

# Get R-squared from glance and print it
(rsq_glance <- glance(unemployment_model)$r.squared)
## [1] 0.8213157

Properly Training a Model

Generating a random test/train split

For the next several exercises you will use the mpg data from the package ggplot2. The data describes the characteristics of several makes and models of cars from different years. The goal is to predict city fuel efficiency from highway fuel efficiency.

In this exercise, you will split mpg into a training set mpg_train (75% of the data) and a test set mpg_test (25% of the data). One way to do this is to generate a column of uniform random numbers between 0 and 1, using the function runif().

If you have a data set dframe of size , and you want a random subset of approximately size % of (where is between 0 and 1), then:

Generate a vector of uniform random numbers: gp = runif(N). dframe[gp < X,] will be about the right size. dframe[gp >= X,] will be the complement.

EXERCISE:

The data frame mpg is in the workspace.

Use the function nrow to get the number of rows in the data frame mpg. Assign this count to the variable N and print

# mpg is in the workspace
summary(mpg)
##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:234         Length:234        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.44                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00
dim(mpg)
## [1] 234  11
# Use nrow to get the number of rows in mpg (N) and print it
(N <- nrow(mpg))
## [1] 234

Calculate about how many rows 75% of N should be. Assign it to the variable target and print it.

# Calculate how many rows 75% of N should be and print it
# Hint: use round() to get an integer
(target <- round(N * 0.75))
## [1] 176

Use runif() to generate a vector of N uniform random numbers, called gp.

# Create the vector of N uniform random variables: gp
gp <- runif(N)

Use gp to split mpg into mpg_train and mpg_test (with mpg_train containing approximately 75% of the data).

# Use gp to create the training set: mpg_train (75% of data) and mpg_test (25% of data)
mpg_train <- mpg[gp < 0.75, ]
mpg_test <- mpg[gp >= 0.75, ]

Use nrow() to check the size of mpg_train and mpg_test. Are they about the right size?

# Use nrow() to examine mpg_train and mpg_test
nrow(mpg_train)
## [1] 183
nrow(mpg_test)
## [1] 51

Train a model using test/train split

Now that you have split the mpg dataset into mpg_train and mpg_test, you will use mpg_train to train a model to predict city fuel efficiency (cty) from highway fuel efficiency (hwy).

EXERCISES:

The data frame mpg_train is in the workspace.

# mpg_train is in the workspace
summary(mpg_train)
##  manufacturer          model               displ            year     
##  Length:183         Length:183         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :1999  
##                                        Mean   :3.481   Mean   :2003  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl          trans               drv                 cty       
##  Min.   :4.00   Length:183         Length:183         Min.   : 9.00  
##  1st Qu.:4.00   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.00   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.88                                         Mean   :16.83  
##  3rd Qu.:8.00                                         3rd Qu.:19.00  
##  Max.   :8.00                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:183         Length:183        
##  1st Qu.:17.50   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.34                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00

Create a formula fmla that expresses the relationship cty as a function of hwy. Print it.

# mpg_train is in the workspace
summary(mpg_train)
##  manufacturer          model               displ            year     
##  Length:183         Length:183         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :1999  
##                                        Mean   :3.481   Mean   :2003  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl          trans               drv                 cty       
##  Min.   :4.00   Length:183         Length:183         Min.   : 9.00  
##  1st Qu.:4.00   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.00   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.88                                         Mean   :16.83  
##  3rd Qu.:8.00                                         3rd Qu.:19.00  
##  Max.   :8.00                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:183         Length:183        
##  1st Qu.:17.50   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.34                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00
# create a formula to express cty as a function of hwy: fmla and print it.
(fmla <- cty ~ hwy)
## cty ~ hwy

Train a model mpg_model on mpg_train to predict cty from hwy using fmla and lm().

# Now use lm() to build a model mpg_model from mpg_train that predicts cty from hwy 
mpg_model <- lm(fmla, data = mpg_train)

Use summary() to examine the model.

# Use summary() to examine the model
summary(mpg_model)
## 
## Call:
## lm(formula = fmla, data = mpg_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9605 -0.7545  0.0395  0.6274  4.5802 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.90012    0.36992   2.433   0.0159 *  
## hwy          0.68242    0.01534  44.474   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.25 on 181 degrees of freedom
## Multiple R-squared:  0.9162, Adjusted R-squared:  0.9157 
## F-statistic:  1978 on 1 and 181 DF,  p-value: < 2.2e-16

Evaluate a model using test/train split

Now you will test the model mpg_model on the test data, mpg_test. Functions rmse() and r_squared() to calculate RMSE and R-squared have been provided for convenience:

-rmse(predcol, ycol) -r_squared(predcol, ycol)

where:

-predcol: The predicted values -ycol: The actual outcome

You will also plot the predictions vs. the outcome.

Generally, model performance is better on the training data than the test data (though sometimes the test set “gets lucky”). A slight difference in performance is okay; if the performance on training is significantly better, there is a problem.

EXERCISE:

The data frames mpg_train and mpg_test, and the model mpg_model are in the workspace, along with the functions rmse() and r_squared().

Predict city fuel efficiency from hwy on the mpg_train data. Assign the predictions to the column pred.

rmse<-function(predcol, ycol) {
  res <- predcol - ycol
  sqrt(mean(res^2))
}

r_squared<-function(predcol, ycol) {
  tss = sum( (ycol - mean(ycol))^2 )
  rss = sum( (predcol - ycol)^2 )
  1 - rss/tss
}

Predict city fuel efficiency from hwy on the mpg_test data. Assign the predictions to the column pred.

# predict cty from hwy for the training set
mpg_train$pred <- predict(mpg_model)
# predict cty from hwy for the test set
mpg_test$pred <- predict(mpg_model, newdata = mpg_test)

Use rmse() to evaluate rmse for both the test and training sets. Compare. Are the performances similar?

# Evaluate the rmse on both training and test data and print them

(rmse_train <- rmse(mpg_train$pred, mpg_train$cty))
## [1] 1.242857
(rmse_test <- rmse(mpg_test$pred, mpg_test$cty))
## [1] 1.264495

Do the same with r_squared(). Are the performances similar?

# Evaluate the r-squared on both training and test data.and print them
(rsq_train <- r_squared(mpg_train$pred, mpg_train$cty))
## [1] 0.9161625
(rsq_test <- r_squared(mpg_test$pred, mpg_test$cty))
## [1] 0.903828

Use ggplot2 to plot the predictions against cty on the test data.

# Plot the predictions (on the x-axis) against the outcome (cty) on the test data
ggplot(mpg_test, aes(x = pred, y = cty)) + 
  geom_point() + 
  geom_abline()

Create a cross validation plan

There are several ways to implement an n-fold cross validation plan. In this exercise you will create such a plan using vtreat::kWayCrossValidation(), and examine it.

kWayCrossValidation() creates a cross validation plan with the following call:

#splitPlan <- kWayCrossValidation(nRows, nSplits, dframe, y)

where nRows is the number of rows of data to be split, and nSplits is the desired number of cross-validation folds.

Strictly speaking, dframe and y aren’t used by kWayCrossValidation; they are there for compatibility with other vtreat data partitioning functions. You can set them both to NULL.

The resulting splitPlan is a list of nSplits elements; each element contains two vectors:

train: the indices of dframe that will form the training set app: the indices of dframe that will form the test (or application) set In this exercise you will create a 3-fold cross-validation plan for the data set mpg.

EXERCISES:

Load the package vtreat.

library(vtreat)

Get the number of rows in mpg and assign it to the variable nRows.

# mpg is in the workspace
summary(mpg)
##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:234         Length:234        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.44                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00

Call kWayCrossValidation to create a 3-fold cross validation plan and assign it to the variable splitPlan.

# Get the number of rows in mpg
nRows <- nrow(mpg)

You can set the last two arguments of the function to NULL.

# Implement the 3-fold cross-fold plan with vtreat
splitPlan <- kWayCrossValidation(nRows, 3,NULL,NULL)

Call str() to examine the structure of splitPlan.

# Examine the split plan
str(splitPlan)
## List of 3
##  $ :List of 2
##   ..$ train: int [1:156] 1 2 3 5 6 7 9 10 11 13 ...
##   ..$ app  : int [1:78] 130 45 212 157 51 172 177 123 179 12 ...
##  $ :List of 2
##   ..$ train: int [1:156] 2 4 5 6 8 10 12 15 16 17 ...
##   ..$ app  : int [1:78] 3 165 27 182 100 112 39 219 233 60 ...
##  $ :List of 2
##   ..$ train: int [1:156] 1 3 4 7 8 9 11 12 13 14 ...
##   ..$ app  : int [1:78] 134 28 5 222 171 122 228 231 21 207 ...
##  - attr(*, "splitmethod")= chr "kwaycross"

Evaluate a modeling procedure using n-fold cross-validation

In this exercise you will use splitPlan, the 3-fold cross validation plan from the previous exercise, to make predictions from a model that predicts mpg\(cty from mpg\)hwy.

If dframe is the training data, then one way to add a column of cross-validation predictions to the frame is as follows:

Initialize a column of the appropriate length dframe$pred.cv <- 0

k is the number of folds splitPlan is the cross validation plan

for(i in 1:k) { Get the ith split split <- splitPlan[[i]]

Build a model on the training data from this split lm, in this case) model <- lm(fmla, data = dframe[split$train,])

make predictions on the application data from this split dframe\(pred.cv[split\)app] <- predict(model, newdata = dframe[split$app,]) }

Cross-validation predicts how well a model built from all the data will perform on new data. As with the test/train split, for a good modeling procedure, cross-validation performance and training performance should be close.

EXERCISE:

The data frame mpg, the cross validation plan splitPlan, and the function to calculate RMSE (rmse()) from one of the previous exercises is available in your workspace.

Run the 3-fold cross validation plan from splitPlan and put the predictions in the column mpg$pred.cv. Use lm() and the formula cty ~ hwy.

# mpg is in the workspace
summary(mpg)
##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:234         Length:234        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.44                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00
# splitPlan is in the workspace
str(splitPlan)
## List of 3
##  $ :List of 2
##   ..$ train: int [1:156] 1 2 3 5 6 7 9 10 11 13 ...
##   ..$ app  : int [1:78] 130 45 212 157 51 172 177 123 179 12 ...
##  $ :List of 2
##   ..$ train: int [1:156] 2 4 5 6 8 10 12 15 16 17 ...
##   ..$ app  : int [1:78] 3 165 27 182 100 112 39 219 233 60 ...
##  $ :List of 2
##   ..$ train: int [1:156] 1 3 4 7 8 9 11 12 13 14 ...
##   ..$ app  : int [1:78] 134 28 5 222 171 122 228 231 21 207 ...
##  - attr(*, "splitmethod")= chr "kwaycross"

Create a linear regression model on all the mpg data (formula cty ~ hwy) and assign the predictions to mpg$pred.

# Run the 3-fold cross validation plan from splitPlan
k <- 3 # Number of folds
mpg$pred.cv <- 0 
for(i in 1:k) {
  split <- splitPlan[[i]]
  model <- lm(cty ~ hwy, data = mpg[split$train, ])
  mpg$pred.cv[split$app] <- predict(model, newdata = mpg[split$app, ])
}

Use rmse() to get the root mean squared error of the predictions from the full model (mpg$pred). Recall that rmse() takes two arguments, the predicted values, and the actual outcome.

# Predict from a full model
mpg$pred <- predict(lm(cty ~ hwy, data = mpg))
# Get the rmse of the full model's predictions
rmse(mpg$pred, mpg$cty)
## [1] 1.247045
# Get the rmse of the cross-validation predictions
rmse(mpg$pred.cv, mpg$cty)
## [1] 1.267149

Get the root mean squared error of the cross-validation predictions. Are the two values about the same?

Issues to Consider

Examining the structure of categorical inputs

For this exercise you will call model.matrix() to examine how R represents data with both categorical and numerical inputs for modeling. The dataset flowers (derived from the Sleuth3 package) is loaded into your workspace. It has the following columns:

Flowers: the average number of flowers on a meadowfoam plant Intensity: the intensity of a light treatment applied to the plant Time: A categorical variable - when (Late or Early) in the lifecycle the light treatment occurred The ultimate goal is to predict Flowers as a function of Time and Intensity.

Exercises:

The data frame flowers is in your workspace.

Call the str() function on flowers to see the types of each column.

# Call str on flowers to see the types of each column
flowers <- read.csv("./Data/flowers.csv", sep = ";")
colnames(flowers)[1]<- "Flowers"
str(flowers)
## 'data.frame':    20 obs. of  3 variables:
##  $ Flowers  : num  62.3 77.4 55.3 54.2 49.6 61.9 39.4 45.7 31.3 44.9 ...
##  $ Time     : chr  "Late" "Late" "Late" "Late" ...
##  $ Intensity: int  150 150 300 300 450 450 600 600 750 750 ...

Use the unique() function on the column flowers$Time to see the possible values that Time takes. How many unique values are there?

# Use unique() to see how many possible values Time takes
unique(flowers$Time)
## [1] "Late"  "Early"

Create a formula to express Flowers as a function of Intensity and Time.

# Build a formula to express Flowers as a function of Intensity and Time: fmla. Print it
(fmla <- as.formula("Flowers ~ Intensity + Time"))
## Flowers ~ Intensity + Time

Assign it to the variable fmla and print it.

Use fmla and model.matrix() to create the model matrix for the data frame flowers. Assign it to the variable mmat.

# Use fmla and model.matrix to see how the data is represented for modeling
mmat <- model.matrix(fmla, flowers)

Use head() to examine the first 20 lines of flowers.

# Examine the first 20 lines of flowers
head(flowers,20)
##    Flowers  Time Intensity
## 1     62.3  Late       150
## 2     77.4  Late       150
## 3     55.3  Late       300
## 4     54.2  Late       300
## 5     49.6  Late       450
## 6     61.9  Late       450
## 7     39.4  Late       600
## 8     45.7  Late       600
## 9     31.3  Late       750
## 10    44.9  Late       750
## 11    36.8  Late       900
## 12    41.9  Late       900
## 13    77.8 Early       150
## 14    75.6 Early       150
## 15    69.1 Early       300
## 16    78.0 Early       300
## 17    57.0 Early       450
## 18    71.1 Early       450
## 19    62.9 Early       600
## 20    52.5 Early       600

Now examine the first 20 lines of mmat.

# Examine the first 20 lines of mmat
head(mmat,20)
##    (Intercept) Intensity TimeLate
## 1            1       150        1
## 2            1       150        1
## 3            1       300        1
## 4            1       300        1
## 5            1       450        1
## 6            1       450        1
## 7            1       600        1
## 8            1       600        1
## 9            1       750        1
## 10           1       750        1
## 11           1       900        1
## 12           1       900        1
## 13           1       150        0
## 14           1       150        0
## 15           1       300        0
## 16           1       300        0
## 17           1       450        0
## 18           1       450        0
## 19           1       600        0
## 20           1       600        0

Is the numeric column Intensity different?

# Fit a model to predict Flowers from Intensity and Time : flower_model
flower_model <- lm(fmla, flowers)

# Use summary on mmat to remind yourself of its structure
summary(mmat)
##   (Intercept)   Intensity      TimeLate  
##  Min.   :1    Min.   :150   Min.   :0.0  
##  1st Qu.:1    1st Qu.:300   1st Qu.:0.0  
##  Median :1    Median :450   Median :1.0  
##  Mean   :1    Mean   :465   Mean   :0.6  
##  3rd Qu.:1    3rd Qu.:600   3rd Qu.:1.0  
##  Max.   :1    Max.   :900   Max.   :1.0

What happened to the categorical column Time from flowers?

How is Time == ‘Early’ represented? And Time == ‘Late’?

Modeling with categorical inputs

For this exercise you will fit a linear model to the flowers data, to predict Flowers as a function of Time and Intensity.

The model formula fmla that you created in the previous exercise is still in your workspace, as is the model matrix mmat.

EXERCISE:

Use fmla and lm to train a linear model that predicts Flowers from Intensity and Time. Assign the model to the variable flower_model.

Use summary() to remind yourself of the structure of mmat.

# flowers in is the workspace
str(flowers)
## 'data.frame':    20 obs. of  3 variables:
##  $ Flowers  : num  62.3 77.4 55.3 54.2 49.6 61.9 39.4 45.7 31.3 44.9 ...
##  $ Time     : chr  "Late" "Late" "Late" "Late" ...
##  $ Intensity: int  150 150 300 300 450 450 600 600 750 750 ...

Use summary() to examine the flower_model. Do the variables match what you saw in mmat?

# fmla is in the workspace
fmla
## Flowers ~ Intensity + Time

Use flower_model to predict the number of flowers. Add the predictions to flowers as the column predictions.

# Fit a model to predict Flowers from Intensity and Time : flower_model
flower_model <- lm(fmla, flowers)

# Use summary on mmat to remind yourself of its structure
summary(mmat)
##   (Intercept)   Intensity      TimeLate  
##  Min.   :1    Min.   :150   Min.   :0.0  
##  1st Qu.:1    1st Qu.:300   1st Qu.:0.0  
##  Median :1    Median :450   Median :1.0  
##  Mean   :1    Mean   :465   Mean   :0.6  
##  3rd Qu.:1    3rd Qu.:600   3rd Qu.:1.0  
##  Max.   :1    Max.   :900   Max.   :1.0

Fill in the blanks to plot predictions vs. actual flowers (predictions on the x-axis).

# Predict the number of flowers on each plant
flowers$predictions <- predict(flower_model, flowers)

# Plot predictions vs actual flowers (predictions on x-axis)
ggplot(flowers, aes(x = predictions, y = Flowers)) + 
  geom_point() +
  geom_abline(color = "blue") 

INTERACTIONS

Modeling an interaction

In this exercise you will use interactions to model the effect of gender and gastric activity on alcohol metabolism.

The data frame alcohol has columns:

  • Metabol: the alcohol metabolism rate

  • Gastric: the rate of gastric alcohol dehydrogenase activity

  • Sex: the sex of the drinker (Male or Female) In the video, we fit three models to the alcohol data:

  • one with only additive (main effect) terms : Metabol ~ Gastric + Sex

  • two models, each with interactions between gastric activity and sex We saw that one of the models with interaction terms had a better R-squared than the additive model, suggesting that using interaction terms gives a better fit. In this exercise we will compare the R-squared of one of the interaction models to the main-effects-only model.

Recall that the operator : designates the interaction between two variables. The operator * designates the interaction between the two variables, plus the main effects.

x*y = x + y + x:y

EXERCISES:

The data frame alcohol is in your workspace.

Write a formula that expresses Metabol as a function of Gastric and Sex with no interactions. Assign the formula to the variable fmla_add and print it.

alcohol <- read.csv("./Data/alcohol.csv", sep = ";")
summary(alcohol)
##    ï..Subject      Metabol         Gastric         Sex           
##  Min.   : 1.0   Min.   : 0.10   Min.   :0.80   Length:31         
##  1st Qu.: 8.5   1st Qu.: 0.80   1st Qu.:1.20   Class :character  
##  Median :16.0   Median : 1.80   Median :1.60   Mode  :character  
##  Mean   :16.0   Mean   : 2.49   Mean   :1.79                     
##  3rd Qu.:23.5   3rd Qu.: 2.95   3rd Qu.:2.20                     
##  Max.   :31.0   Max.   :12.30   Max.   :5.20                     
##    Alcohol         
##  Length:31         
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Write a formula that expresses Metabol as a function of the interaction between Gastric and Sex.

# Create the formula with main effects only
(fmla_add <- as.formula('Metabol~Gastric+Sex')  )
## Metabol ~ Gastric + Sex
# Create the formula with interactions
(fmla_interaction <- as.formula('Metabol~Gastric+Gastric:Sex')  )
## Metabol ~ Gastric + Gastric:Sex

Add Gastric as a main effect, but not Sex.

# Fit the main effects only model
model_add <- lm(fmla_add,alcohol)

Assign the formula to the variable fmla_interaction and print it.

# Fit the interaction model
model_interaction <- lm(fmla_interaction, alcohol)

Call summary() on both models. Which has a better R-squared?

# Call summary on both models and compare
summary(model_add)
## 
## Call:
## lm(formula = fmla_add, data = alcohol)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0728 -0.4774 -0.1106  0.5922  3.9604 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.3579     0.6768  -2.006  0.05456 .  
## Gastric       1.6678     0.3649   4.571 8.95e-05 ***
## SexMale       2.2276     0.6542   3.405  0.00202 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.652 on 28 degrees of freedom
## Multiple R-squared:  0.6438, Adjusted R-squared:  0.6183 
## F-statistic:  25.3 on 2 and 28 DF,  p-value: 5.296e-07
summary(model_interaction)
## 
## Call:
## lm(formula = fmla_interaction, data = alcohol)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2760 -0.3208  0.0830  0.7217  3.7179 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.01202    0.72356   0.017 0.986860    
## Gastric          0.77500    0.46767   1.657 0.108657    
## Gastric:SexMale  1.28574    0.32312   3.979 0.000444 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.57 on 28 degrees of freedom
## Multiple R-squared:  0.6782, Adjusted R-squared:  0.6553 
## F-statistic: 29.51 on 2 and 28 DF,  p-value: 1.275e-07

In this exercise, you will compare the performance of the interaction model you fit in the previous exercise to the performance of a main-effects only model. Because this data set is small, we will use cross-validation to simulate making predictions on out-of-sample data.

You will begin to use the dplyr package to do calculations.

mutate() adds new columns to a tbl (a type of data frame) group_by() specifies how rows are grouped in a tbl summarize() computes summary statistics of a column You will also use tidyr’s gather() which takes multiple columns and collapses them into key-value pairs.

EXERCISE:

The data frame alcohol and the formulas fmla_add and fmla_interaction are in the workspace.

summary(alcohol)
##    ï..Subject      Metabol         Gastric         Sex           
##  Min.   : 1.0   Min.   : 0.10   Min.   :0.80   Length:31         
##  1st Qu.: 8.5   1st Qu.: 0.80   1st Qu.:1.20   Class :character  
##  Median :16.0   Median : 1.80   Median :1.60   Mode  :character  
##  Mean   :16.0   Mean   : 2.49   Mean   :1.79                     
##  3rd Qu.:23.5   3rd Qu.: 2.95   3rd Qu.:2.20                     
##  Max.   :31.0   Max.   :12.30   Max.   :5.20                     
##    Alcohol         
##  Length:31         
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Use kWayCrossValidation() to create a splitting plan for a 3-fold cross validation. -The first argument is the number of rows to be split. -The second argument is the number of folds for the cross-validation. -You can set the 3rd and 4th arguments of the function to NULL.

# Both the formulae are in the workspace
fmla_add
## Metabol ~ Gastric + Sex
fmla_interaction
## Metabol ~ Gastric + Gastric:Sex
# Create the splitting plan for 3-fold cross validation
set.seed(34245)  # set the seed for reproducibility

#splitPlan <- kWayCrossValidation(nRows, nSplits, dframe, y)

splitPlan <- kWayCrossValidation(nrow(alcohol), 3, NULL, NULL)

Examine and run the sample code to get the 3-fold cross-validation predictions of a model with no interactions and assign them to the column pred_add.

head(alcohol)
##   ï..Subject Metabol Gastric    Sex       Alcohol
## 1          1     0.6     1.0 Female     Alcoholic
## 2          2     0.6     1.6 Female     Alcoholic
## 3          3     1.5     1.5 Female     Alcoholic
## 4          4     0.4     2.2 Female Non-alcoholic
## 5          5     0.1     1.1 Female Non-alcoholic
## 6          6     0.2     1.2 Female Non-alcoholic
# Sample code: Get cross-val predictions for main-effects only model
alcohol$pred_add <- 0  # initialize the prediction vector
head(alcohol)
##   ï..Subject Metabol Gastric    Sex       Alcohol pred_add
## 1          1     0.6     1.0 Female     Alcoholic        0
## 2          2     0.6     1.6 Female     Alcoholic        0
## 3          3     1.5     1.5 Female     Alcoholic        0
## 4          4     0.4     2.2 Female Non-alcoholic        0
## 5          5     0.1     1.1 Female Non-alcoholic        0
## 6          6     0.2     1.2 Female Non-alcoholic        0
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_add <- lm(fmla_add, data = alcohol[split$train, ])
  alcohol$pred_add[split$app] <- predict(model_add, newdata = alcohol[split$app, ])
}
head(alcohol)
##   ï..Subject Metabol Gastric    Sex       Alcohol  pred_add
## 1          1     0.6     1.0 Female     Alcoholic 0.5041792
## 2          2     0.6     1.6 Female     Alcoholic 1.6345202
## 3          3     1.5     1.5 Female     Alcoholic 0.9135050
## 4          4     0.4     2.2 Female Non-alcoholic 2.7648612
## 5          5     0.1     1.1 Female Non-alcoholic 0.7438025
## 6          6     0.2     1.2 Female Non-alcoholic 0.8809596

Get the 3-fold cross-validation predictions of the model with interactions. Assign the predictions to the column pred_interaction. -The sample code shows you the procedure. -Use the same splitPlan that you already created.

# Get the cross-val predictions for the model with interactions
alcohol$pred_interaction <- 0 # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_interaction <- lm(fmla_interaction, data = alcohol[split$train, ])
  alcohol$pred_interaction[split$app] <- predict(model_interaction, newdata = alcohol[split$app, ])
}

Fill in the blanks to -gather the predictions into a single column pred. -add a column of residuals (actual outcome - predicted outcome). -get the RMSE of the cross-validation predictions for each model type.

# Get RMSE
alcohol %>% 
  gather(key = modeltype, value = pred, pred_add, pred_interaction) %>%
  mutate(residuals = Metabol - pred) %>%
  group_by(modeltype) %>%
  summarize(rmse = sqrt(mean(residuals^2)))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##   modeltype         rmse
##   <chr>            <dbl>
## 1 pred_add          1.86
## 2 pred_interaction  1.61

Compare the RMSEs. Based on these results, which model should you use?

Transforming the response before modeling

In this exercise, you will compare relative error to absolute error. For the purposes of modeling, we will define relative error as

that is, the error is relative to the true outcome. You will measure the overall relative error of a model using root mean squared relative error:

where

is the mean of .

The example (toy) dataset fdata is loaded in your workspace. It includes the columns:

-y: the true output to be predicted by some model; imagine it is the amount of money a customer will spend on a visit to your store. -pred: the predictions of a model that predicts y. -label: categorical: whether y comes from a population that makes small purchases, or large ones. You want to know which model does “better”: the one predicting the small purchases, or the one predicting large ones.

EXERCISES:

The data frame fdata is in the workspace.

Fill in the blanks to examine the data. Notice that large purchases tend to be about 100 times larger than small ones.

Fill in the blanks to create error columns: Define residual as y - pred.

Define relative error as residual / y.

Fill in the blanks to calculate and compare RMSE and relative RMSE. How do the absolute errors compare? The relative errors?

Examine the plot of predictions versus outcome. In your opinion, which model does “better”?

Modeling log-transformed monetary output

In this exercise, you will practice modeling on log-transformed monetary output, and then transforming the “log-money” predictions back into monetary units. The data loaded into your workspace records subjects’ incomes in 2005 (Income2005), as well as the results of several aptitude tests taken by the subjects in 1981:

Arith Word Parag Math AFQT (Percentile on the Armed Forces Qualifying Test)

The data have already been split into training and test sets (income_train and income_test respectively) and are in the workspace. You will build a model of log(income) from the inputs, and then convert log(income) back into income.

EXERCISE:

Call summary() on income_train$Income2005 to see the summary statistics of income in the training set.

income_test <- read.csv("./Data/income_test.csv")
income_train <- read.csv("./Data/income_train.csv")
summary(income_train$Income2005)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      63   23000   39000   49894   61500  703637

Write a formula to express log(Income2005) as a function of the five tests as the variable fmla.log. Print it.

# Write the formula for log income as a function of the tests and print it
(fmla.log <- log(Income2005)~Arith+Word+Parag+Math+AFQT)
## log(Income2005) ~ Arith + Word + Parag + Math + AFQT

Fit a linear model of log(Income2005) to the income_train data: model.log.

# Fit the linear model
model.log <-  lm(fmla.log,data=income_train)

Use model.log to predict income on the income_test dataset. Put it in the column logpred. Check summary() of logpred to see that the magnitudes are much different from those of Income2005.

# Make predictions on income_test
income_test$logpred <- predict(model.log,newdata=income_test)
summary(income_test$logpred)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   9.766  10.133  10.423  10.419  10.705  11.006

Reverse the log transformation to put the predictions into “monetary units”: exp(income_test$logpred). Check summary() of pred.income and see that the magnitudes are now similar to Income2005 magnitudes.

# Convert the predictions to monetary units
income_test$pred.income <- exp(income_test$logpred)
summary(income_test$pred.income)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17432   25167   33615   35363   44566   60217

Fill in the blanks to plot a scatter plot of predicted income vs income on the test set.

ggplot(income_test, aes(x = pred.income, y = Income2005)) + 
  geom_point() + 
  geom_abline(color = "blue")

Comparing RMSE and root-mean-squared Relative Error

In this exercise, you will show that log-transforming a monetary output before modeling improves mean relative error (but increases RMSE) compared to modeling the monetary output directly. You will compare the results of model.log from the previous exercise to a model (model.abs) that directly fits income.

The income_train and income_test datasets are loaded in your workspace, along with your model, model.log.

Also in the workspace:

model.abs: a model that directly fits income to the inputs using the formula

Income2005 ~ Arith + Word + Parag + Math + AFQT

# fmla.abs is in the workspace
fmla.abs<-as.formula(Income2005 ~ Arith + Word + Parag + Math + AFQT)
model.abs<-lm(formula = fmla.abs, data = income_train)
# model.abs is in the workspace
summary(model.abs)
## 
## Call:
## lm(formula = fmla.abs, data = income_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -78728 -24137  -6979  11964 648573 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  17516.7     6420.1   2.728  0.00642 ** 
## Arith         1552.3      303.4   5.116 3.41e-07 ***
## Word          -132.3      265.0  -0.499  0.61754    
## Parag        -1155.1      618.3  -1.868  0.06189 .  
## Math           725.5      372.0   1.950  0.05127 .  
## AFQT           177.8      144.1   1.234  0.21734    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45500 on 2063 degrees of freedom
## Multiple R-squared:  0.1165, Adjusted R-squared:  0.1144 
## F-statistic:  54.4 on 5 and 2063 DF,  p-value: < 2.2e-16

Fill in the blanks to add predictions from the models to income_test. Don’t forget to take the exponent of the predictions from model.log to undo the log transform!

# Add predictions to the test set
income_test <- income_test %>%
  mutate(pred.absmodel = predict(model.abs, income_test),        # predictions from model.abs
         pred.logmodel = exp(predict(model.log, income_test)))   # predictions from model.log

Fill in the blanks to gather() the predictions and calculate the residuals and relative error.

# Gather the predictions and calculate residuals and relative error
income_long <- income_test %>% 
  gather(key = modeltype, value = pred, pred.absmodel, pred.logmodel) %>%
  mutate(residual = pred-Income2005,   # residuals
         relerr   = residual/Income2005)   # relative error

Fill in the blanks to calculate the RMSE and relative RMSE for predictions. Which model has larger absolute error? Larger relative error?

# Calculate RMSE and relative RMSE and compare
income_long %>% 
  group_by(modeltype) %>%      # group by modeltype
  summarize(rmse     = sqrt(mean(residual^2)),    # RMSE
            rmse.rel = sqrt(mean(relerr^2)))   # Root mean squared relative error
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 3
##   modeltype       rmse rmse.rel
##   <chr>          <dbl>    <dbl>
## 1 pred.absmodel 37448.     3.18
## 2 pred.logmodel 39235.     2.22

Transforming inputs before modeling

Input transforms: the “hockey stick”

In this exercise, we will build a model to predict price from a measure of the house’s size (surface area). The data set houseprice has the columns:

price : house price in units of $1000 size: surface area A scatterplot of the data shows that the data is quite non-linear: a sort of “hockey-stick” where price is fairly flat for smaller houses, but rises steeply as the house gets larger. Quadratics and tritics are often good functional forms to express hockey-stick like relationships. Note that there may not be a “physical” reason that price is related to the square of the size; a quadratic is simply a closed form approximation of the observed relationship.

You will fit a model to predict price as a function of the squared size, and look at its fit on the training data.

Because ^ is also a symbol to express interactions, use the function I() to treat the expression x^2 “as is”: that is, as the square of x rather than the interaction of x with itself.

exampleFormula = y ~ I(x^2)

EXERCISES:

The data set houseprice is in the workspace.

# houseprice is in the workspace
houseprice <- read.csv("./Data/houseprice,csv")
summary(houseprice)
##        X              size           price          pred_lin     
##  Min.   : 1.00   Min.   : 44.0   Min.   : 42.0   Min.   : 41.13  
##  1st Qu.:10.75   1st Qu.: 73.5   1st Qu.:164.5   1st Qu.:166.58  
##  Median :20.50   Median : 91.0   Median :203.5   Median :236.31  
##  Mean   :20.50   Mean   : 94.3   Mean   :249.2   Mean   :243.75  
##  3rd Qu.:30.25   3rd Qu.:118.5   3rd Qu.:287.8   3rd Qu.:337.04  
##  Max.   :40.00   Max.   :150.0   Max.   :573.0   Max.   :469.62  
##     pred_sqr     
##  Min.   : 80.52  
##  1st Qu.:156.78  
##  Median :216.96  
##  Mean   :245.14  
##  3rd Qu.:337.13  
##  Max.   :517.12

Write a formula, fmla_sqr, to express price as a function of squared size. Print it.

# Create the formula for price as a function of squared size
(fmla_sqr <- price~I(size^2))
## price ~ I(size^2)

Fit a model model_sqr to the data using fmla_sqr

# Fit a model of price as a function of squared size (use fmla_sqr)
model_sqr <- lm(fmla_sqr, data = houseprice)

For comparison, fit a linear model model_lin to the data using the formula price ~ size.

# Fit a model of price as a linear function of size
model_lin <- lm(price ~ size, data = houseprice)

Fill in the blanks to make predictions from the training data from the two models gather the predictions into a single column pred graphically compare the predictions of the two models to the data. Which fits better?

# Make predictions and compare
houseprice %>% 
    mutate(pred_lin = predict(model_lin),       # predictions from linear model
           pred_sqr = predict(model_sqr)) %>%   # predictions from quadratic model 
    gather(key = modeltype, value = pred, pred_lin, pred_sqr) %>% # gather the predictions
    ggplot(aes(x = size)) + 
       geom_point(aes(y = price)) +                   # actual prices
       geom_line(aes(y = pred, color = modeltype)) + # the predictions
       scale_color_brewer(palette = "Dark2")

Input transforms: the “hockey stick” (2)

In the last exercise you saw that a quadratic model seems to fit the houseprice data better than a linear model. In this exercise you will confirm whether the quadratic model would perform better on out-of-sample data. Since this data set is small, you will use cross-validation. The quadratic formula fmla_sqr that you created in the last exercise is in your workspace.

For comparison, the sample code will calculate cross-validation predictions from a linear model price ~ size.

EXERCISE:

# houseprice is in the workspace
summary(houseprice)
##        X              size           price          pred_lin     
##  Min.   : 1.00   Min.   : 44.0   Min.   : 42.0   Min.   : 41.13  
##  1st Qu.:10.75   1st Qu.: 73.5   1st Qu.:164.5   1st Qu.:166.58  
##  Median :20.50   Median : 91.0   Median :203.5   Median :236.31  
##  Mean   :20.50   Mean   : 94.3   Mean   :249.2   Mean   :243.75  
##  3rd Qu.:30.25   3rd Qu.:118.5   3rd Qu.:287.8   3rd Qu.:337.04  
##  Max.   :40.00   Max.   :150.0   Max.   :573.0   Max.   :469.62  
##     pred_sqr     
##  Min.   : 80.52  
##  1st Qu.:156.78  
##  Median :216.96  
##  Mean   :245.14  
##  3rd Qu.:337.13  
##  Max.   :517.12

The data frame houseprice and the formula fmla_sqr from the last exercise are in the workspace.

Use kWayCrossValidation() to create a splitting plan for a 3-fold cross validation. You can set the 3rd and 4th arguments of the function to NULL.

# fmla_sqr is in the workspace
fmla_sqr
## price ~ I(size^2)
# Create a splitting plan for 3-fold cross validation
set.seed(34245)  # set the seed for reproducibility
splitPlan <- kWayCrossValidation(nrow(houseprice), 3, NULL, NULL)

Examine and run the sample code to get the 3-fold cross-validation predictions of the model price ~ size and add them to the column pred_lin.

# Sample code: get cross-val predictions for price ~ size
houseprice$pred_lin <- 0  # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_lin <- lm(price ~ size, data = houseprice[split$train,])
  houseprice$pred_lin[split$app] <- predict(model_lin, newdata = houseprice[split$app,])
}

Get the cross-validation predictions for price as a function of squared size. Assign them to the column pred_sqr. The sample code gives you the procedure. You can use the splitting plan you already created.

# Get cross-val predictions for price as a function of size^2 (use fmla_sqr)
houseprice$pred_sqr <- 0 # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_sqr <- lm(fmla_sqr, data = houseprice[split$train, ])
  houseprice$pred_sqr[split$app] <- predict(model_sqr, newdata = houseprice[split$app, ])
}

Fill in the blanks to gather the predictions and calculate the residuals.

# Gather the predictions and calculate the residuals
houseprice_long <- houseprice %>%
  gather(key = modeltype, value = pred, pred_lin, pred_sqr) %>%
  mutate(residuals = pred-price)

Fill in the blanks to compare the RMSE for the two models. Which one fits better?

# Compare the cross-validated RMSE for the two models
houseprice_long %>% 
  group_by(modeltype) %>% # group by modeltype
  summarize(rmse =  sqrt(mean(residuals^2)))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##   modeltype  rmse
##   <chr>     <dbl>
## 1 pred_lin   73.1
## 2 pred_sqr   60.3

Dealing with Non-Linear Responses

Fit a model of sparrow survival probability

In this exercise, you will estimate the probability that a sparrow survives a severe winter storm, based on physical characteristics of the sparrow. The dataset sparrow is loaded into your workspace. The outcome to be predicted is status (“Survived”, “Perished”). The variables we will consider are:

total_length: length of the bird from tip of beak to tip of tail (mm) weight: in grams humerus : length of humerus (“upper arm bone” that connects the wing to the body) (inches) Remember that when using glm() to create a logistic regression model, you must explicitly specify that family = binomial:

glm(formula, data = data, family = binomial) You will call summary(), broom::glance() to see different functions for examining a logistic regression model. One of the diagnostics that you will look at is the analog to , called pseudo-.

You can think of deviance as analogous to variance: it is a measure of the variation in categorical data. The pseudo- is analogous to for standard regression: is a measure of the “variance explained” of a regression model. The pseudo- is a measure of the “deviance explained”.

EXERCISES:

The data frame sparrow and the package broom are loaded in the workspace.

# sparrow is in the workspace
sparrow <- read.csv("./Data/sparrow.csv")
summary(sparrow)
##        X           status              age             total_length  
##  Min.   : 1.0   Length:87          Length:87          Min.   :153.0  
##  1st Qu.:22.5   Class :character   Class :character   1st Qu.:158.0  
##  Median :44.0   Mode  :character   Mode  :character   Median :160.0  
##  Mean   :44.0                                         Mean   :160.4  
##  3rd Qu.:65.5                                         3rd Qu.:162.5  
##  Max.   :87.0                                         Max.   :167.0  
##     wingspan         weight       beak_head        humerus      
##  Min.   :236.0   Min.   :23.2   Min.   :29.80   Min.   :0.6600  
##  1st Qu.:245.0   1st Qu.:24.7   1st Qu.:31.40   1st Qu.:0.7250  
##  Median :247.0   Median :25.8   Median :31.70   Median :0.7400  
##  Mean   :247.5   Mean   :25.8   Mean   :31.64   Mean   :0.7353  
##  3rd Qu.:251.0   3rd Qu.:26.7   3rd Qu.:32.10   3rd Qu.:0.7500  
##  Max.   :256.0   Max.   :31.0   Max.   :33.00   Max.   :0.7800  
##      femur           legbone          skull           sternum      
##  Min.   :0.6500   Min.   :1.010   Min.   :0.5600   Min.   :0.7700  
##  1st Qu.:0.7000   1st Qu.:1.110   1st Qu.:0.5900   1st Qu.:0.8300  
##  Median :0.7100   Median :1.130   Median :0.6000   Median :0.8500  
##  Mean   :0.7134   Mean   :1.131   Mean   :0.6032   Mean   :0.8511  
##  3rd Qu.:0.7300   3rd Qu.:1.160   3rd Qu.:0.6100   3rd Qu.:0.8800  
##  Max.   :0.7600   Max.   :1.230   Max.   :0.6400   Max.   :0.9300  
##   survived            pred         
##  Mode :logical   Min.   :0.006315  
##  FALSE:36        1st Qu.:0.278124  
##  TRUE :51        Median :0.703563  
##                  Mean   :0.586207  
##                  3rd Qu.:0.875717  
##                  Max.   :0.994546

As suggested in the video, you will predict on the outcomes TRUE and FALSE. Create a new column survived in the sparrow data frame that is TRUE when status == “Survived”.

# Create the survived column
sparrow$survived <- sparrow$status=="Survived"

Create the formula fmla that expresses survived as a function of the variables of interest. Print it.

# Create the formula
(fmla <- survived~total_length+weight+humerus)
## survived ~ total_length + weight + humerus

Fit a logistic regression model to predict the probability of sparrow survival. Assign the model to the variable sparrow_model.

# Fit the logistic regression model
sparrow_model <- glm(fmla, data = sparrow, family = "binomial")

Call summary() to see the coefficients of the model, the deviance and the null deviance.

# Call summary
summary(sparrow_model)
## 
## Call:
## glm(formula = fmla, family = "binomial", data = sparrow)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1117  -0.6026   0.2871   0.6577   1.7082  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   46.8813    16.9631   2.764 0.005715 ** 
## total_length  -0.5435     0.1409  -3.858 0.000115 ***
## weight        -0.5689     0.2771  -2.053 0.040060 *  
## humerus       75.4610    19.1586   3.939 8.19e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 118.008  on 86  degrees of freedom
## Residual deviance:  75.094  on 83  degrees of freedom
## AIC: 83.094
## 
## Number of Fisher Scoring iterations: 5

Call glance() on the model to see the deviances and other diagnostics in a data frame. Assign the output from glance() to the variable perf.

# Call glance
(perf <- glance(sparrow_model))
## # A tibble: 1 x 8
##   null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
## 1          118.      86  -37.5  83.1  93.0     75.1          83    87

Calculate the pseudo-.

# Calculate pseudo-R-squared
(pseudoR2 <- 1-perf$deviance/perf$null.deviance)
## [1] 0.3636526

Predict sparrow survival

In this exercise you will predict the probability of survival using the sparrow survival model from the previous exercise.

Recall that when calling predict() to get the predicted probabilities from a glm() model, you must specify that you want the response:

predict(model, type = “response”) Otherwise, predict() on a logistic regression model returns the predicted log-odds of the event, not the probability.

You will also use the GainCurvePlot() function to plot the gain curve from the model predictions. If the model’s gain curve is close to the ideal (“wizard”) gain curve, then the model sorted the sparrows well: that is, the model predicted that sparrows that actually survived would have a higher probability of survival. The inputs to the GainCurvePlot() function are:

frame: data frame with prediction column and ground truth column xvar: the name of the column of predictions (as a string) truthVar: the name of the column with actual outcome (as a string) title: a title for the plot (as a string) GainCurvePlot(frame, xvar, truthVar, title)

EXERCISE:

The dataframe sparrow and the model sparrow_model are in the workspace.

# sparrow is in the workspace
summary(sparrow)
##        X           status              age             total_length  
##  Min.   : 1.0   Length:87          Length:87          Min.   :153.0  
##  1st Qu.:22.5   Class :character   Class :character   1st Qu.:158.0  
##  Median :44.0   Mode  :character   Mode  :character   Median :160.0  
##  Mean   :44.0                                         Mean   :160.4  
##  3rd Qu.:65.5                                         3rd Qu.:162.5  
##  Max.   :87.0                                         Max.   :167.0  
##     wingspan         weight       beak_head        humerus      
##  Min.   :236.0   Min.   :23.2   Min.   :29.80   Min.   :0.6600  
##  1st Qu.:245.0   1st Qu.:24.7   1st Qu.:31.40   1st Qu.:0.7250  
##  Median :247.0   Median :25.8   Median :31.70   Median :0.7400  
##  Mean   :247.5   Mean   :25.8   Mean   :31.64   Mean   :0.7353  
##  3rd Qu.:251.0   3rd Qu.:26.7   3rd Qu.:32.10   3rd Qu.:0.7500  
##  Max.   :256.0   Max.   :31.0   Max.   :33.00   Max.   :0.7800  
##      femur           legbone          skull           sternum      
##  Min.   :0.6500   Min.   :1.010   Min.   :0.5600   Min.   :0.7700  
##  1st Qu.:0.7000   1st Qu.:1.110   1st Qu.:0.5900   1st Qu.:0.8300  
##  Median :0.7100   Median :1.130   Median :0.6000   Median :0.8500  
##  Mean   :0.7134   Mean   :1.131   Mean   :0.6032   Mean   :0.8511  
##  3rd Qu.:0.7300   3rd Qu.:1.160   3rd Qu.:0.6100   3rd Qu.:0.8800  
##  Max.   :0.7600   Max.   :1.230   Max.   :0.6400   Max.   :0.9300  
##   survived            pred         
##  Mode :logical   Min.   :0.006315  
##  FALSE:36        1st Qu.:0.278124  
##  TRUE :51        Median :0.703563  
##                  Mean   :0.586207  
##                  3rd Qu.:0.875717  
##                  Max.   :0.994546
# sparrow_model is in the workspace
summary(sparrow_model)
## 
## Call:
## glm(formula = fmla, family = "binomial", data = sparrow)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1117  -0.6026   0.2871   0.6577   1.7082  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   46.8813    16.9631   2.764 0.005715 ** 
## total_length  -0.5435     0.1409  -3.858 0.000115 ***
## weight        -0.5689     0.2771  -2.053 0.040060 *  
## humerus       75.4610    19.1586   3.939 8.19e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 118.008  on 86  degrees of freedom
## Residual deviance:  75.094  on 83  degrees of freedom
## AIC: 83.094
## 
## Number of Fisher Scoring iterations: 5

Create a new column in sparrow called pred that contains the predictions on the training data.

# Make predictions
sparrow$pred <- predict(sparrow_model, type = "response")

Call GainCurvePlot() to create the gain curve of predictions. Does the model do a good job of sorting the sparrows by whether or not they actually survived?

# Look at gain curve
GainCurvePlot(sparrow, "pred", "survived", "sparrow survival model")

###Poisson and quasipoisson regression to predict counts

Fit a model to predict bike rental counts

In this exercise you will build a model to predict the number of bikes rented in an hour as a function of the weather, the type of day (holiday, working day, or weekend), and the time of day. You will train the model on data from the month of July.

The data frame has the columns:

-cnt: the number of bikes rented in that hour (the outcome) -hr: the hour of the day (0-23, as a factor) -holiday: TRUE/FALSE -workingday: TRUE if neither a holiday nor a weekend, else FALSE -weathersit: categorical, “Clear to partly cloudy”/“Light -Precipitation”/“Misty” -temp: normalized temperature in Celsius -atemp: normalized “feeling” temperature in Celsius -hum: normalized humidity -windspeed: normalized windspeed -instant: the time index – number of hours since beginning of data set (not a variable) -mnth and yr: month and year indices (not variables)

Remember that you must specify family = poisson or family = quasipoisson when using glm() to fit a count model.

Since there are a lot of input variables, for convenience we will specify the outcome and the inputs in variables, and use paste() to assemble a string representing the model formula.

EXERCISE:

The data frame bikesJuly is in the workspace. The names of the outcome variable and the input variables are also in the workspace as the variables outcome and vars respectively.

# bikesJuly is in the workspace
bikesJuly <- read.csv("./Data/bikesJuly.csv")
bikesAugust <- read.csv("./Data/bikesAugust.csv")
str(bikesJuly)
## 'data.frame':    744 obs. of  13 variables:
##  $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ workingday: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
##  $ temp      : num  0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
##  $ atemp     : num  0.727 0.697 0.697 0.712 0.667 ...
##  $ hum       : num  0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
##  $ windspeed : num  0 0.1343 0.0896 0.1343 0.194 ...
##  $ cnt       : int  149 93 90 33 4 10 27 50 142 219 ...
##  $ instant   : int  13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
##  $ mnth      : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...
# The outcome column
outcome<-c("cnt")
# The inputs to use
vars<-c("hr","holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed")

Fill in the blanks to create the formula fmla expressing cnt as a function of the inputs. Print it.

# Create the formula string for bikes rented as a function of the inputs
(fmla <- paste(outcome, "~", paste(vars, collapse = " + ")))
## [1] "cnt ~ hr + holiday + workingday + weathersit + temp + atemp + hum + windspeed"

Calculate the mean (mean()) and variance (var()) of bikesJuly$cnt. Should you use poisson or quasipoisson regression?

# Calculate the mean and variance of the outcome
(mean_bikes <- mean(bikesJuly$cnt))
## [1] 273.6653
(var_bikes <- var(bikesJuly$cnt))
## [1] 45863.84

Use glm() to fit a model to the bikesJuly data: bike_model.

# Fit the model
bike_model <- glm(fmla,data=bikesJuly,family=quasipoisson)

Use glance() to look at the model’s fit statistics. Assign the output of glance() to the variable perf.

# Call glance
(perf <- glance(bike_model))
## # A tibble: 1 x 8
##   null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
## 1       133365.     743     NA    NA    NA   87059.         734   744

Calculate the pseudo-R-squared of the model.

# Calculate pseudo-R-squared
(pseudoR2 <- 1-perf$deviance/perf$null.deviance)
## [1] 0.3472145

Predict bike rentals on new data

In this exercise you will use the model you built in the previous exercise to make predictions for the month of August. The data set bikesAugust has the same columns as bikesJuly.

Recall that you must specify type = “response” with predict() when predicting counts from a glm poisson or quasipoisson model.

EXERCISE:

The model bike_model and the data frame bikesAugust are in the workspace.

# bikesAugust is in the workspace
str(bikesAugust)
## 'data.frame':    744 obs. of  14 variables:
##  $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ workingday: logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
##  $ temp      : num  0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
##  $ atemp     : num  0.636 0.606 0.576 0.576 0.591 ...
##  $ hum       : num  0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
##  $ windspeed : num  0.1642 0.0896 0.1045 0.1045 0.1343 ...
##  $ cnt       : int  47 33 13 7 4 49 185 487 681 350 ...
##  $ instant   : int  13748 13749 13750 13751 13752 13753 13754 13755 13756 13757 ...
##  $ mnth      : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pred      : num  94.96 51.74 37.98 17.58 9.36 ...
# bike_model is in the workspace
summary(bike_model)
## 
## Call:
## glm(formula = fmla, family = quasipoisson, data = bikesJuly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -20.613   -9.712   -3.379    4.536   34.847  
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    5.783187   0.716088   8.076 2.74e-15 ***
## hr                             0.052373   0.004225  12.395  < 2e-16 ***
## holidayTRUE                    0.050045   0.142381   0.351   0.7253    
## workingdayTRUE                 0.041671   0.060260   0.692   0.4895    
## weathersitLight Precipitation -0.049635   0.126787  -0.391   0.6956    
## weathersitMisty                0.114739   0.065897   1.741   0.0821 .  
## temp                          -2.231584   1.859082  -1.200   0.2304    
## atemp                          2.266270   1.573259   1.440   0.1502    
## hum                           -1.563113   0.390401  -4.004 6.87e-05 ***
## windspeed                      0.590418   0.263472   2.241   0.0253 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 129.1499)
## 
##     Null deviance: 133365  on 743  degrees of freedom
## Residual deviance:  87059  on 734  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5

Use predict to predict the number of bikes per hour on the bikesAugust data. Assign the predictions to the column bikesAugust$pred.

# Make predictions on August data
bikesAugust$pred  <- predict(bike_model, type = "response", newdata = bikesAugust)
head(bikesAugust)
##   X hr holiday workingday             weathersit temp  atemp  hum windspeed cnt
## 1 1  0   FALSE       TRUE Clear to partly cloudy 0.68 0.6364 0.79    0.1642  47
## 2 2  1   FALSE       TRUE Clear to partly cloudy 0.66 0.6061 0.83    0.0896  33
## 3 3  2   FALSE       TRUE Clear to partly cloudy 0.64 0.5758 0.83    0.1045  13
## 4 4  3   FALSE       TRUE Clear to partly cloudy 0.64 0.5758 0.83    0.1045   7
## 5 5  4   FALSE       TRUE                  Misty 0.64 0.5909 0.78    0.1343   4
## 6 6  5   FALSE       TRUE                  Misty 0.64 0.5909 0.78    0.1343  49
##   instant mnth yr      pred
## 1   13748    8  1 100.65729
## 2   13749    8  1  93.08276
## 3   13750    8  1  96.60430
## 4   13751    8  1 101.79863
## 5   13752    8  1 137.01351
## 6   13753    8  1 144.38060

Fill in the blanks to get the RMSE of the predictions on the August data.

# Calculate the RMSE
bikesAugust %>% 
  mutate(residual = pred - cnt) %>%
  summarize(rmse  = sqrt(mean(residual^2)))
##       rmse
## 1 188.1238

Fill in the blanks to generate the plot of predictions to actual counts. Do any of the predictions appear negative?

# Plot predictions vs cnt (pred on x-axis)
ggplot(bikesAugust, aes(x = pred, y = cnt)) +
  geom_point() + 
  geom_abline(color = "darkblue")

Visualize the bike rental predictions

In the previous exercise, you visualized the bike model’s predictions using the standard “outcome vs. prediction” scatter plot. Since the bike rental data is time series data, you might be interested in how the model performs as a function of time. In this exercise, you will compare the predictions and actual rentals on an hourly basis, for the first 14 days of August.

To create the plot you will use the function tidyr::gather() to consolidate the predicted and actual values from bikesAugust in a single column. gather() takes as arguments:

-The “wide” data frame to be gathered (implicit in a pipe) -The name of the key column to be created - contains the names of the gathered columns. -The name of the value column to be created - contains the values of the gathered columns. -The names of the columns to be gathered into a single column.

You’ll use the gathered data frame to compare the actual and predicted rental counts as a function of time. The time index, instant counts the number of observations since the beginning of data collection. The sample code converts the instants to daily units, starting from 0.

EXERCISE:

The data frame bikesAugust, with the predictions (bikesAugust$pred) is in the workspace.

Fill in the blanks to plot the predictions and actual counts by hour for the first 14 days of August. convert instant to be in day units, rather than hour gather() the cnt and pred columns into a column called value, with a key called valuetype. filter() for th e first two weeks of August Plot value as a function of instant (day).

# Plot predictions and cnt by date/time
head(bikesAugust)
##   X hr holiday workingday             weathersit temp  atemp  hum windspeed cnt
## 1 1  0   FALSE       TRUE Clear to partly cloudy 0.68 0.6364 0.79    0.1642  47
## 2 2  1   FALSE       TRUE Clear to partly cloudy 0.66 0.6061 0.83    0.0896  33
## 3 3  2   FALSE       TRUE Clear to partly cloudy 0.64 0.5758 0.83    0.1045  13
## 4 4  3   FALSE       TRUE Clear to partly cloudy 0.64 0.5758 0.83    0.1045   7
## 5 5  4   FALSE       TRUE                  Misty 0.64 0.5909 0.78    0.1343   4
## 6 6  5   FALSE       TRUE                  Misty 0.64 0.5909 0.78    0.1343  49
##   instant mnth yr      pred
## 1   13748    8  1 100.65729
## 2   13749    8  1  93.08276
## 3   13750    8  1  96.60430
## 4   13751    8  1 101.79863
## 5   13752    8  1 137.01351
## 6   13753    8  1 144.38060
bikesAugust %>% 
  # set start to 0, convert unit to days
  mutate(instant = (instant - min(instant))/24) %>%  
  # gather cnt and pred into a value column
  gather(key = valuetype, value = value, cnt, pred) %>%
  filter(instant < 14) %>% # restric to first 14 days
  # plot value by instant
  ggplot(aes(x = instant, y = value, color = valuetype, linetype = valuetype)) + 
  geom_point() + 
  geom_line() + 
  scale_x_continuous("Day", breaks = 0:14, labels = 0:14) + 
  scale_color_brewer(palette = "Dark2") + 
  ggtitle("Predicted August bike rentals, Quasipoisson model")

Does the model see the general time patterns in bike rentals?

GAM to learn non-linear transforms

Model soybean growth with GAM

In this exercise you will model the average leaf weight on a soybean plant as a function of time (after planting). As you will see, the soybean plant doesn’t grow at a steady rate, but rather has a “growth spurt” that eventually tapers off. Hence, leaf weight is not well described by a linear model.

Recall that you can designate which variable you want to model non-linearly in a formula with the s() function:

y ~ s(x) Also remember that gam() from the package mgcv has the calling interface

gam(formula, family, data) For standard regression, use family = gaussian (the default).

The soybean training data, soybean_train is loaded into your workspace. It has two columns: the outcome weight and the variable Time. For comparison, the linear model model.lin, which was fit using the formula weight ~ Time has already been loaded into the workspace as well.

EXERCISE:

Fill in the blanks to plot weight versus Time (Time on x-axis). Does the relationship look linear?

library(foreign)
Soybean <- load("./Data/Soybean.RData")
# soybean_train is in the workspace
summary(soybean_train)
##       Plot     Variety   Year          Time           weight       
##  1988F6 : 10   F:161   1988:124   Min.   :14.00   Min.   : 0.0290  
##  1988F7 :  9   P:169   1989:102   1st Qu.:27.00   1st Qu.: 0.6663  
##  1988P1 :  9           1990:104   Median :42.00   Median : 3.5233  
##  1988P8 :  9                      Mean   :43.56   Mean   : 6.1645  
##  1988P2 :  9                      3rd Qu.:56.00   3rd Qu.:10.3808  
##  1988F3 :  8                      Max.   :84.00   Max.   :27.3700  
##  (Other):276
# Plot weight vs Time (Time on x axis)
ggplot(soybean_train, aes(x = Time, y = weight)) + 
  geom_point() 

Load the package mgcv.

# Load the package mgcv
library(mgcv)

Create the formula fmla.gam to express weight as a non-linear function of Time. Print it.

# Create the formula 
(fmla.gam <- weight ~ s(Time))
## weight ~ s(Time)

Fit a generalized additive model on soybean_train using fmla.gam.

# Fit the GAM Model
model.gam <- gam(fmla.gam, family = gaussian, data = soybean_train)

Call summary() on the linear model model.lin (already in your workspace). What is the ?

# From previous step
library(mgcv)
fmla.gam <- weight ~ s(Time)
model.gam <- gam(fmla.gam, data = soybean_train, family = gaussian)

Call summary() on ’model.gam. The “deviance explained” reports the model’s unadjusted . What is the ? Which model appears to be a better fit to the training data?

# Call summary() on model.gam and look for R-squared
summary(model.gam)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## weight ~ s(Time)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.1645     0.1143   53.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F p-value    
## s(Time) 8.495   8.93 338.2  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.902   Deviance explained = 90.4%
## GCV = 4.4395  Scale est. = 4.3117    n = 330

Call plot() on model.gam to see the derived relationship between Time and weight

# Call plot() on model.gam
plot(model.gam)

Tree-Based Methods

Random forests

Build a random forest model for bike rentals

In this exercise you will again build a model to predict the number of bikes rented in an hour as a function of the weather, the type of day (holiday, working day, or weekend), and the time of day. You will train the model on data from the month of July.

You will use the ranger package to fit the random forest model. For this exercise, the key arguments to the ranger() call are:

formula data num.trees: the number of trees in the forest. respect.unordered.factors : Specifies how to treat unordered factor variables. We recommend setting this to “order” for regression. seed: because this is a random algorithm, you will set the seed to get reproducible results Since there are a lot of input variables, for convenience we will specify the outcome and the inputs in the variables outcome and vars, and use paste() to assemble a string representing the model formula.

EXERCISE:

The data frame bikesJuly is in the workspace. The sample code specifies the names of the outcome and input variables.

bikesJuly <- read.csv("./Data/bikesJuly.csv")
# bikesJuly is in the workspace
str(bikesJuly)
## 'data.frame':    744 obs. of  13 variables:
##  $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ workingday: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
##  $ temp      : num  0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
##  $ atemp     : num  0.727 0.697 0.697 0.712 0.667 ...
##  $ hum       : num  0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
##  $ windspeed : num  0 0.1343 0.0896 0.1343 0.194 ...
##  $ cnt       : int  149 93 90 33 4 10 27 50 142 219 ...
##  $ instant   : int  13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
##  $ mnth      : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...
# Random seed to reproduce results
seed <- 423563
# The outcome column
(outcome <- "cnt")
## [1] "cnt"
# The input variables
(vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed"))
## [1] "hr"         "holiday"    "workingday" "weathersit" "temp"      
## [6] "atemp"      "hum"        "windspeed"

Fill in the blanks to create the formula fmla expressing cnt as a function of the inputs. Print it.

Load the package ranger.

# Create the formula string for bikes rented as a function of the inputs
(fmla <- paste(outcome, "~", paste(vars, collapse = " + ")))
## [1] "cnt ~ hr + holiday + workingday + weathersit + temp + atemp + hum + windspeed"
# Load the package ranger
library(ranger)

Use ranger to fit a model to the bikesJuly data: bike_model_rf. The first argument to ranger() is the formula, fmla. Use 500 trees and respect.unordered.factors = “order”. Set the seed to seed for reproducible results. Print the model. What is the R-squared?

# Fit and print the random forest model
(bike_model_rf <- ranger(fmla, # formula 
                         bikesJuly, # data
                         num.trees = 500, 
                         respect.unordered.factors = "order", 
                         seed = seed))
## Ranger result
## 
## Call:
##  ranger(fmla, bikesJuly, num.trees = 500, respect.unordered.factors = "order",      seed = seed) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      744 
## Number of independent variables:  8 
## Mtry:                             2 
## Target node size:                 5 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       10491.9 
## R squared (OOB):                  0.771238

Predict bike rentals with the random forest model

In this exercise you will use the model that you fit in the previous exercise to predict bike rentals for the month of August.

The predict() function for a ranger model produces a list. One of the elements of this list is predictions, a vector of predicted values. You can access predictions with the $ notation for accessing named elements of a list:

predict(model, data)$predictions

EXERCISE:

The model bike_model_rf and the dataset bikesAugust (for evaluation) are loaded into your workspace.

# bikesAugust is in the workspace
str(bikesAugust)
## 'data.frame':    744 obs. of  14 variables:
##  $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ workingday: logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
##  $ temp      : num  0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
##  $ atemp     : num  0.636 0.606 0.576 0.576 0.591 ...
##  $ hum       : num  0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
##  $ windspeed : num  0.1642 0.0896 0.1045 0.1045 0.1343 ...
##  $ cnt       : int  47 33 13 7 4 49 185 487 681 350 ...
##  $ instant   : int  13748 13749 13750 13751 13752 13753 13754 13755 13756 13757 ...
##  $ mnth      : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pred      : num  100.7 93.1 96.6 101.8 137 ...
# bike_model_rf is in the workspace
bike_model_rf
## Ranger result
## 
## Call:
##  ranger(fmla, bikesJuly, num.trees = 500, respect.unordered.factors = "order",      seed = seed) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      744 
## Number of independent variables:  8 
## Mtry:                             2 
## Target node size:                 5 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       10491.9 
## R squared (OOB):                  0.771238

Call predict() on bikesAugust to predict the number of bikes rented in August (cnt). Add the predictions to bikesAugust as the column pred.

# Make predictions on the August data
bikesAugust$pred <- predict(bike_model_rf, bikesAugust)$predictions

Fill in the blanks to calculate the root mean squared error of the predictions. The poisson model you built for this data gave an RMSE of about 112.6. How does this model compare?

# Calculate the RMSE of the predictions
bikesAugust %>% 
  mutate(residual = cnt - pred)  %>%        # calculate the residual
  summarize(rmse  = sqrt(mean(residual^2))) # calculate rmse
##       rmse
## 1 116.1133

Fill in the blanks to plot actual bike rental counts (cnt) versus the predictions (pred on x-axis).

# Plot actual outcome vs predictions (predictions on x-axis)
ggplot(bikesAugust, aes(x = pred, y = cnt)) + 
  geom_point() + 
  geom_abline()

Visualize random forest bike model predictions

In the previous exercise, you saw that the random forest bike model did better on the August data than the quasiposson model, in terms of RMSE.

In this exercise you will visualize the random forest model’s August predictions as a function of time. The corresponding plot from the quasipoisson model that you built in a previous exercise is in the workspace for you to compare.

Recall that the quasipoisson model mostly identified the pattern of slow and busy hours in the day, but it somewhat underestimated peak demands. You would like to see how the random forest model compares.

The data frame bikesAugust (with predictions) is in the workspace. The plot quasipoisson_plot of quasipoisson model predictions as a function of time is shown.

EXERCISE:

Fill in the blanks to plot the predictions and actual counts by hour for the first 14 days of August. gather the cnt and pred columns into a column called value, with a key called valuetype.

first_two_weeks <- bikesAugust %>% 
  # Set start to 0, convert unit to days
  mutate(instant = (instant - min(instant)) / 24) %>% 
  # Gather cnt and pred into a column named value with key valuetype
  gather(key = valuetype, value = value, cnt, pred) %>%
  # Filter for rows in the first two
  filter(instant < 14) 

Plot value as a function of instant (day).

# Plot predictions and cnt by date/time 
ggplot(first_two_weeks, aes(x = instant, y = value, color = valuetype, linetype = valuetype)) + 
  geom_point() + 
  geom_line() + 
  scale_x_continuous("Day", breaks = 0:14, labels = 0:14) + 
  scale_color_brewer(palette = "Dark2") + 
  ggtitle("Predicted August bike rentals, Random Forest plot")

How does the random forest model compare?

One-Hot-Encoding Categorical Variables

vtreat on a small example

In this exercise you will use vtreat to one-hot-encode a categorical variable on a small example. vtreat creates a treatment plan to transform categorical variables into indicator variables (coded “lev”), and to clean bad values out of numerical variables (coded “clean”).

To design a treatment plan use the function designTreatmentsZ()

#treatplan <- designTreatmentsZ(data, varlist)

data: the original training data frame varlist: a vector of input variables to be treated (as strings). designTreatmentsZ() returns a list with an element scoreFrame: a data frame that includes the names and types of the new variables:

#scoreFrame <- treatplan %>% 
            #magrittr::use_series(scoreFrame) %>% 
            #select(varName, origName, code)

varName: the name of the new treated variable origName: the name of the original variable that the treated variable comes from code: the type of the new variable. “clean”: a numerical variable with no NAs or NaNs “lev”: an indicator variable for a specific level of the original categorical variable. (magrittr::use_series() is an alias for $ that you can use in pipes.)

For these exercises, we want varName where code is either “clean” or “lev”:

#newvarlist <- scoreFrame %>% 
             #filter(code %in% c("clean", "lev")) %>%
             #magrittr::use_series(varName)

To transform the data set into all numerical and one-hot-encoded variables, use prepare():

data.treat <- prepare(treatplan, data, varRestrictions = newvarlist) treatplan: the treatment plan data: the data frame to be treated varRestrictions: the variables desired in the treated data

EXERCISES:

The data frame dframe and the package magrittr are loaded in the workspace.

# dframe is in the workspace
dframe <- read.csv("./Data/dframe.csv", sep = ";")
colnames(dframe)[1]<- "color"

Print dframe. We will assume that color and size are input variables, and popularity is the outcome to be predicted.

dframe
##    color size popularity
## 1      b   13       1.07
## 2      r   11       1.39
## 3      r   15       0.92
## 4      r   14       1.20
## 5      r   13       1.08
## 6      b   11       0.80
## 7      r    9       1.10
## 8      g   12       0.87
## 9      b    7       0.69
## 10     b   12       0.88

Create a vector called vars with the names of the input variables (as strings).

# Create a vector of variable names
(vars <- c("color", "size"))
## [1] "color" "size"

Use designTreatmentsZ() to create a treatment plan for the variables in vars. Assign it to the variable treatplan.

# Create the treatment plan
treatplan <- designTreatmentsZ(dframe, vars)
## [1] "vtreat 1.6.2 inspecting inputs Sun Jan 31 11:51:14 2021"
## [1] "designing treatments Sun Jan 31 11:51:14 2021"
## [1] " have initial level statistics Sun Jan 31 11:51:14 2021"
## [1] " scoring treatments Sun Jan 31 11:51:14 2021"
## [1] "have treatment plan Sun Jan 31 11:51:14 2021"

Get and examine the scoreFrame from the treatment plan to see the mapping from old variables to new variables.

# Examine the scoreFrame
(scoreFrame <- treatplan %>%
    use_series(scoreFrame))
##         varName varMoves rsq sig needsSplit extraModelDegrees origName  code
## 1    color_catP     TRUE   0   1       TRUE                 2    color  catP
## 2          size     TRUE   0   1      FALSE                 0     size clean
## 3 color_lev_x_b     TRUE   0   1      FALSE                 0    color   lev
## 4 color_lev_x_g     TRUE   0   1      FALSE                 0    color   lev
## 5 color_lev_x_r     TRUE   0   1      FALSE                 0    color   lev

You only need the columns varName, origName and code.

What are the names of the new indicator variables? Of the continuous variable?

Create a vector newvars that contains the variable varName where code is either clean or lev. Print it.

# We only want the rows with codes "clean" or "lev"
(newvars <- scoreFrame %>%
    filter(code %in% c("clean", "lev")) %>%
    use_series(varName))
## [1] "size"          "color_lev_x_b" "color_lev_x_g" "color_lev_x_r"

Use prepare() to create a new data frame dframe.treat that is a one-hot-encoded version of dframe (without the outcome column).

# Create the treated training data
(dframe.treat <- prepare(treatplan, dframe, varRestriction = newvars))
##    size color_lev_x_b color_lev_x_g color_lev_x_r
## 1    13             1             0             0
## 2    11             0             0             1
## 3    15             0             0             1
## 4    14             0             0             1
## 5    13             0             0             1
## 6    11             1             0             0
## 7     9             0             0             1
## 8    12             0             1             0
## 9     7             1             0             0
## 10   12             1             0             0

Print it and compare to dframe.

dframe
##    color size popularity
## 1      b   13       1.07
## 2      r   11       1.39
## 3      r   15       0.92
## 4      r   14       1.20
## 5      r   13       1.08
## 6      b   11       0.80
## 7      r    9       1.10
## 8      g   12       0.87
## 9      b    7       0.69
## 10     b   12       0.88

18.- Unsupervised learning in R

rm(list = ls())

CARGA DE LIBRERIAS

library(dplyr)
library(tidyverse)
library(magrittr)
library(broom)
library(ggplot2)
library(readxl)
library(gdata)
library(jsonlite)
library(haven)
library(foreign)
library(broom)
library(sigr)
library(WVPlots)
library(vtreat)
library(Sleuth3)
library(stringr)
library(tidyr)
library(ranger)
library(vtreat)
library(ggthemes)
library(dslabs)
library(RColorBrewer)
library(WVPlots)
library(dummies)
## Warning: package 'dummies' was built under R version 4.0.3
## dummies-1.5.6 provided by Decision Patterns
library(dendextend)
## Warning: package 'dendextend' was built under R version 4.0.3
## 
## ---------------------
## Welcome to dendextend version 1.14.0
## Type citation('dendextend') for how to cite the package.
## 
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
## 
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## Or contact: <tal.galili@gmail.com>
## 
##  To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
## ---------------------
## 
## Attaching package: 'dendextend'
## The following object is masked from 'package:data.table':
## 
##     set
## The following object is masked from 'package:stats':
## 
##     cutree
library(purrr)
library(cluster)
## Warning: package 'cluster' was built under R version 4.0.3
library(tibble)
library(tidyr)
library(base)
library(COUNT)
library(assertive)
library(zeallot)
library(broom)
library(mgcv)
library(utils)
library(gdata)
library(openxlsx)
library(dplyr)
library(assertive)
library(stringr)
library(ggplot2)
library(lubridate)
library(visdat)
library(stringdist)
library(fuzzyjoin)
library(reclin)
library(anytime)
library(lubridate)
library(ggridges)

library(microbenchmark)

library(fasttime)

Introduction to k-means clustering

k-means clustering

We have created some two-dimensional data and stored it in a variable called x in your workspace. The scatter plot on the right is a visual representation of the data.

In this exercise, your task is to create a k-means model of the x data using 3 clusters, then to look at the structure of the resulting model using the summary() function.

EXERCISE

Fit a k-means model to x using 3 centers and run the k-means algorithm 20 times. Store the result in km.out. Inspect the result with the summary() function.

para poder sacar la información de datacamp se ha hecho uso de: dput(x[,c(2)])

x <- read.csv("./Data/x4.csv")
x <- x[,c(-1)]
colnames(x)[1] <- "1"
colnames(x)[2] <- "2"
# Create the k-means model: km.out
km.out <- kmeans(x, 3, nstart = 20)
# Inspect the result
summary(km.out)
##              Length Class  Mode   
## cluster      300    -none- numeric
## centers        6    -none- numeric
## totss          1    -none- numeric
## withinss       3    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           3    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric

Results of kmeans()

The kmeans() function produces several outputs. In the video, we discussed one output of modeling, the cluster membership.

In this exercise, you will access the cluster component directly. This is useful anytime you need the cluster membership for each observation of the data used to build the clustering model. A future exercise will show an example of how this cluster membership might be used to help communicate the results of k-means modeling.

k-means models also have a print method to give a human friendly output of basic modeling results. This is available by using print() or simply typing the name of the model.

EXERCISES:

The k-means model you built in the last exercise, km.out, is still available in your workspace.

Print a list of the cluster membership to the console. Use a print method to print out the km.out model.

# Print the cluster membership component of the model
km.out$cluster
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1
##  [38] 1 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1
##  [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [186] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [223] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 2 1 1 1 1
## [260] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1
## [297] 1 2 1 1
# Print the km.out object
km.out
## K-means clustering with 3 clusters of sizes 52, 98, 150
## 
## Cluster means:
##            1           2
## 1  0.6642455 -0.09132968
## 2  2.2171113  2.05110690
## 3 -5.0556758  1.96991743
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1
##  [38] 1 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1
##  [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [186] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [223] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 2 1 1 1 1
## [260] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1
## [297] 1 2 1 1
## 
## Within cluster sum of squares by cluster:
## [1]  95.50625 148.64781 295.16925
##  (between_SS / total_SS =  87.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Visualizing and interpreting results of kmeans()

One of the more intuitive ways to interpret the results of k-means models is by plotting the data as a scatter plot and using color to label the samples’ cluster membership. In this exercise, you will use the standard plot() function to accomplish this.

To create a scatter plot, you can pass data with two features (i.e. columns) to plot() with an extra argument col = km.out$cluster, which sets the color of each point in the scatter plot according to its cluster membership.

EXERCISES:

x and km.out are available in your workspace. Using the plot() function to create a scatter plot of data x:

Color the dots on the scatterplot by setting the col argument to the cluster component in km.out. Title the plot “k-means with 3 clusters” using the main argument to plot(). Ensure there are no axis labels by specifying "" for both the xlab and ylab arguments to plot().

# Scatter plot of x
plot(x, 
  col = km.out$cluster,
  main = "k-means with 3 clusters",
  xlab = "",
  ylab = "")

How k-means works and practical matters

Handling random algorithms

In the video, you saw how kmeans() randomly initializes the centers of clusters. This random initialization can result in assigning observations to different cluster labels. Also, the random initialization can result in finding different local minima for the k-means algorithm. This exercise will demonstrate both results.

At the top of each plot, the measure of model quality—total within cluster sum of squares error—will be plotted. Look for the model(s) with the lowest error to find models with the better model results.

Because kmeans() initializes observations to random clusters, it is important to set the random number generator seed for reproducibility.

EXERCISE:

The data, x, is still available in your workspace. Your task is to generate six kmeans() models on the data, plotting the results of each, in order to see the impact of random initializations on model results.

Set the random number seed to 1 with set.seed().

# Set up 2 x 3 plotting grid
par(mfrow = c(2, 3))

# Set seed
set.seed(1)

For each iteration of the for loop, run kmeans() on x. Assume the number of clusters is 3 and number of starts (nstart) is 1. Visualize the cluster memberships using the col argument to plot(). Observe how the measure of quality and cluster assignments vary among the six model runs.

for(i in 1:6) {
  # Run kmeans() on x with three clusters and one start
  km.out <- kmeans(x, 3, nstart = 1)
  
  # Plot clusters
  plot(x, col = km.out$cluster, 
       main = km.out$tot.withinss, 
       xlab = "", ylab = "")
}

Selecting number of clusters

The k-means algorithm assumes the number of clusters as part of the input. If you know the number of clusters in advance (e.g. due to certain business constraints) this makes setting the number of clusters easy. However, as you saw in the video, if you do not know the number of clusters and need to determine it, you will need to run the algorithm multiple times, each time with a different number of clusters. From this, you can observe how a measure of model quality changes with the number of clusters.

In this exercise, you will run kmeans() multiple times to see how model quality changes as the number of clusters changes. Plots displaying this information help to determine the number of clusters and are often referred to as scree plots.

The ideal plot will have an elbow where the quality measure improves more slowly as the number of clusters increases. This indicates that the quality of the model is no longer improving substantially as the model complexity (i.e. number of clusters) increases. In other words, the elbow indicates the number of clusters inherent in the data.

EXERCISE:

The data, x, is still available in your workspace.

Build 15 kmeans() models on x, each with a different number of clusters (ranging from 1 to 15). Set nstart = 20 for all model runs and save the total within cluster sum of squares for each model to the ith element of wss.

# Initialize total within sum of squares error: wss
wss <- 0

# For 1 to 15 cluster centers
for (i in 1:15) {
  km.out <- kmeans(x, centers = i, nstart  = 20)
  # Save total within sum of squares to wss variable
  wss[i] <- km.out$tot.withinss
}

Run the code provided to create a scree plot of the wss for all 15 models.

# Plot total within sum of squares vs. number of clusters
plot(1:15, wss, type = "b", 
     xlab = "Number of Clusters", 
     ylab = "Within groups sum of squares")

Take a look at your scree plot. How many clusters are inherent in the data? Set k equal to the number of clusters at the location of the elbow.

# Set k equal to the number of clusters corresponding to the elbow location
k <- 2

Introduction to the Pokemon data

Practical matters: working with real data

Dealing with real data is often more challenging than dealing with synthetic data. Synthetic data helps with learning new concepts and techniques, but the next few exercises will deal with data that is closer to the type of real data you might find in your professional or academic pursuits.

The first challenge with the Pokemon data is that there is no pre-determined number of clusters. You will determine the appropriate number of clusters, keeping in mind that in real data the elbow in the scree plot might be less of a sharp elbow than in synthetic data. Use your judgement on making the determination of the number of clusters.

The second part of this exercise includes plotting the outcomes of the clustering on two dimensions, or features, of the data. These features were chosen somewhat arbitrarily for this exercise. Think about how you would use plotting and clustering to communicate interesting groups of Pokemon to other people.

An additional note: this exercise utilizes the iter.max argument to kmeans(). As you’ve seen, kmeans() is an iterative algorithm, repeating over and over until some stopping criterion is reached. The default number of iterations for kmeans() is 10, which is not enough for the algorithm to converge and reach its stopping criterion, so we’ll set the number of iterations to 50 to overcome this issue. To see what happens when kmeans() does not converge, try running the example with a lower number of iterations (e.g. 3). This is another example of what might happen when you encounter real data and use real cases.

pokemon <- read.csv("./Data/pokemon.csv")

pokemon <- pokemon[,c(6:11)]
names(pokemon)
## [1] "HitPoints"      "Attack"         "Defense"        "SpecialAttack" 
## [5] "SpecialDefense" "Speed"

EXERCISE:

The pokemon dataset, which contains observations of 800 Pokemon characters on 6 dimensions (i.e. features), is available in your workspace.

Using kmeans() with nstart = 20, determine the total within sum of square errors for different numbers of clusters (between 1 and 15).

# Initialize total within sum of squares error: wss
wss <- 0

# Look over 1 to 15 possible clusters
for (i in 1:15) {
  # Fit the model: km.out
  km.out <- kmeans(pokemon, centers = i, nstart = 20, iter.max = 50)
  # Save the within cluster sum of squares
  wss[i] <- km.out$tot.withinss
}

Pick an appropriate number of clusters based on these results from the first instruction and assign that number to k.

# Produce a scree plot
plot(1:15, wss, type = "b", 
     xlab = "Number of Clusters", 
     ylab = "Within groups sum of squares")

# Select number of clusters
k <- 4

Create a k-means model using k clusters and assign it to the km.out variable.

# Build model with k clusters: km.out
km.pokemon <- kmeans(pokemon, centers = k, nstart = 20, iter.max = 50)

Create a scatter plot of Defense vs. Speed, showing cluster membership for each observation.

# Plot of Defense vs. Speed by cluster membership
plot(pokemon[, c("Defense", "Speed")],
     col = km.out$cluster,
     main = paste("k-means clustering of Pokemon with", k, "clusters"),
     xlab = "Defense", ylab = "Speed")

Hierarchical clustering

Hierarchical clustering with results

In this exercise, you will create your first hierarchical clustering model using the hclust() function.

We have created some data that has two dimensions and placed it in a variable called x. Your task is to create a hierarchical clustering model of x. Remember from the video that the first step to hierarchical clustering is determining the similarity between observations, which you will do with the dist() function.

You will look at the structure of the resulting model using the summary() function.

EXERCISES:

Fit a hierarchical clustering model to x using the hclust() function. Store the result in hclust.out.

x2 <- read.csv("./Data/x2.csv")
x2 <- x2[,c(2,3)] 
# Create hierarchical clustering model: hclust.out
hclust.out <- hclust(dist(x2))

Inspect the result with the summary() function.

# Inspect the result
summary(hclust.out)
##             Length Class  Mode     
## merge       98     -none- numeric  
## height      49     -none- numeric  
## order       50     -none- numeric  
## labels       0     -none- NULL     
## method       1     -none- character
## call         2     -none- call     
## dist.method  1     -none- character

Selecting number of clusters

Cutting the tree

Remember from the video that cutree() is the R function that cuts a hierarchical model. The h and k arguments to cutree() allow you to cut the tree based on a certain height h or a certain number of clusters k.

In this exercise, you will use cutree() to cut the hierarchical model you created earlier based on each of these two criteria.

EXERCISE:

The hclust.out model you created earlier is available in your workspace.

Cut the hclust.out model at height 7.

# Cut by height
cutree(hclust.out, h = 7)
##  [1] 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 2 2 2
## [39] 2 2 2 2 2 2 2 2 2 2 2 2

Cut the hclust.out model to create 3 clusters.

# Cut by number of clusters
cutree(hclust.out, k = 3)
##  [1] 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 2 2 2
## [39] 2 2 2 2 2 2 2 2 2 2 2 2

Clustering linkage and practical matters

Linkage methods

In this exercise, you will produce hierarchical clustering models using different linkages and plot the dendrogram for each, observing the overall structure of the trees.

You’ll be asked to interpret the results in the next exercise.

EXERCISE:

Produce three hierarchical clustering models on x using the “complete”, “average”, and “single” linkage methods, respectively.

# Cluster using complete linkage: hclust.complete
hclust.complete <- hclust(dist(x), method = "complete")

# Cluster using average linkage: hclust.average
hclust.average <- hclust(dist(x), method = "average")

# Cluster using single linkage: hclust.single
hclust.single <- hclust(dist(x), method = "single")

Plot a dendrogram for each model, using titles of “Complete”, “Average”, and “Single”, respectively.

# Plot dendrogram of hclust.complete
plot(hclust.complete, main = "Complete")

# Plot dendrogram of hclust.average
plot(hclust.average, main = "Average")

# Plot dendrogram of hclust.single
plot(hclust.single, main = "Single")

Practical matters: scaling

Recall from the video that clustering real data may require scaling the features if they have different distributions. So far in this chapter, you have been working with synthetic data that did not need scaling.

In this exercise, you will go back to working with “real” data, the pokemon dataset introduced in the first chapter. You will observe the distribution (mean and standard deviation) of each feature, scale the data accordingly, then produce a hierarchical clustering model using the complete linkage method.

EXERCISE:

The data is stored in the pokemon object in your workspace.

Observe the mean of each variable in pokemon using the colMeans() function.

# View column means
colMeans(pokemon)
##      HitPoints         Attack        Defense  SpecialAttack SpecialDefense 
##       69.25875       79.00125       73.84250       72.82000       71.90250 
##          Speed 
##       68.27750

Observe the standard deviation of each variable using the apply() and sd() functions. Since the variables are the columns of your matrix, make sure to specify 2 as the MARGIN argument to apply().

# View column standard deviations
apply(pokemon, 2, sd)
##      HitPoints         Attack        Defense  SpecialAttack SpecialDefense 
##       25.53467       32.45737       31.18350       32.72229       27.82892 
##          Speed 
##       29.06047

Scale the pokemon data using the scale() function and store the result in pokemon.scaled.

# Scale the data
pokemon.scaled <- scale(pokemon)

Create a hierarchical clustering model of the pokemon.scaled data using the complete linkage method. Manually specify the method argument and store the result in hclust.pokemon.

# Create hierarchical clustering model: hclust.pokemon
hclust.pokemon <- hclust(dist(pokemon.scaled) , method = "complete")

Comparing kmeans() and hclust()

Comparing k-means and hierarchical clustering, you’ll see the two methods produce different cluster memberships. This is because the two algorithms make different assumptions about how the data is generated. In a more advanced course, we could choose to use one model over another based on the quality of the models’ assumptions, but for now, it’s enough to observe that they are different.

This exercise will have you compare results from the two models on the pokemon dataset to see how they differ.

EXERCISE:

The results from running k-means clustering on the pokemon data (for 3 clusters) are stored as km.pokemon. The hierarchical clustering model you created in the previous exercise is still available as hclust.pokemon.

Using cutree() on hclust.pokemon, assign cluster membership to each observation. Assume three clusters and assign the result to a vector called cut.pokemon.

# Apply cutree() to hclust.pokemon: cut.pokemon
cut.pokemon <- cutree(hclust.pokemon, k = 3)

Using table(), compare cluster membership between the two clustering methods. Recall that the different components of k-means model objects can be accessed with the $ operator.

# Compare methods
table(km.pokemon$cluster, cut.pokemon)
##    cut.pokemon
##       1   2   3
##   1 114   0   1
##   2 114   0   0
##   3 283   0   0
##   4 277  11   0

Dimensionality reduction with PCA

PCA using prcomp()

In this exercise, you will create your first PCA model and observe the diagnostic results.

We have loaded the Pokemon data from earlier, which has four dimensions, and placed it in a variable called pokemon. Your task is to create a PCA model of the data, then to inspect the resulting model using the summary() function.

EXERCISES:

Create a PCA model of the data in pokemon, setting scale to TRUE. Store the result in pr.out.

# Perform scaled PCA: pr.out
pr.out <- prcomp(x = pokemon, 
scale = TRUE)

Inspect the result with the summary() function.

# Inspect model output
summary(pr.out)
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6
## Standard deviation     1.6466 1.0457 0.8825 0.8489 0.65463 0.51681
## Proportion of Variance 0.4519 0.1822 0.1298 0.1201 0.07142 0.04451
## Cumulative Proportion  0.4519 0.6342 0.7640 0.8841 0.95549 1.00000

Variance explained

The second common plot type for understanding PCA models is a scree plot. A scree plot shows the variance explained as the number of principal components increases. Sometimes the cumulative variance explained is plotted as well.

In this and the next exercise, you will prepare data from the pr.out model you created at the beginning of the chapter for use in a scree plot. Preparing the data for plotting is required because there is not a built-in function in R to create this type of plot.

EXERCISE:

pr.out and the pokemon data are still available in your workspace.

Assign to the variable pr.var the square of the standard deviations of the principal components (i.e. the variance). The standard deviation of the principal components is available in the sdev component of the PCA model object.

# Variability of each principal component: pr.var
pr.var <- pr.out$sdev^2

Assign to the variable pve the proportion of the variance explained, calculated by dividing pr.var by the total variance explained by all principal components.

# Variance explained by each principal component: pve
pve <- pr.var / sum(pr.var)

Visualize variance explained

Now you will create a scree plot showing the proportion of variance explained by each principal component, as well as the cumulative proportion of variance explained.

Recall from the video that these plots can help to determine the number of principal components to retain. One way to determine the number of principal components to retain is by looking for an elbow in the scree plot showing that as the number of principal components increases, the rate at which variance is explained decreases substantially. In the absence of a clear elbow, you can use the scree plot as a guide for setting a threshold.

EXERCISES:

The proportion of variance explained is still available in the pve object you created in the last exercise.

head(pve)
## [1] 0.45190665 0.18225358 0.12979086 0.12011089 0.07142337 0.04451466
summary(pve)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04451 0.08360 0.12495 0.16667 0.16914 0.45191

Use plot() to plot the proportion of variance explained by each principal component.

# Plot variance explained for each principal component
plot(pve, xlab = "Principal Component",
     ylab = "Proportion of Variance Explained",
     ylim = c(0, 1), type = "b")

Use plot() and cumsum() (cumulative sum) to plot the cumulative proportion of variance explained as a function of the number principal components.

# Plot cumulative proportion of variance explained
plot(cumsum(pve), xlab = "Principal Component",
     ylab = "Cumulative Proportion of Variance Explained",
     ylim = c(0, 1), type = "b")

Practical issues with PCA

Practical issues: scaling

You saw in the video that scaling your data before doing PCA changes the results of the PCA modeling. Here, you will perform PCA with and without scaling, then visualize the results using biplots.

Sometimes scaling is appropriate when the variances of the variables are substantially different. This is commonly the case when variables have different units of measurement, for example, degrees Fahrenheit (temperature) and miles (distance). Making the decision to use scaling is an important step in performing a principal component analysis.

EXERCISES:

The same Pokemon dataset is available in your workspace as pokemon, but one new variable has been added: Total.

There is some code at the top of the editor to calculate the mean and standard deviation of each variable in the model. Run this code to see how the scale of the variables differs in the original data.

# Mean of each variable
colMeans(pokemon)
##      HitPoints         Attack        Defense  SpecialAttack SpecialDefense 
##       69.25875       79.00125       73.84250       72.82000       71.90250 
##          Speed 
##       68.27750

Create a PCA model of pokemon with scaling, assigning the result to pr.with.scaling.

# Standard deviation of each variable
apply(pokemon, 2, sd)
##      HitPoints         Attack        Defense  SpecialAttack SpecialDefense 
##       25.53467       32.45737       31.18350       32.72229       27.82892 
##          Speed 
##       29.06047

Create a PCA model of pokemon without scaling, assigning the result to pr.without.scaling.

# PCA model with scaling: pr.with.scaling
pr.with.scaling <- prcomp(x = pokemon, 
scale = TRUE)
# PCA model without scaling: pr.without.scaling
pr.without.scaling <- prcomp(x = pokemon, 
scale = FALSE)

Use biplot() to plot both models (one at a time) and compare their outputs.

biplot(pr.with.scaling)

biplot(pr.without.scaling)

Putting it all together with a case study

Preparing the data

Unlike prior chapters, where we prepared the data for you for unsupervised learning, the goal of this chapter is to step you through a more realistic and complete workflow.

Recall from the video that the first step is to download and prepare the data.

EXERCISES:

Use read.csv() function to download the CSV (comma-separated values) file containing the data from the URL provided. Assign the result to wisc.df.

url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1903/datasets/WisconsinCancer.csv"

# Download the data: wisc.df
wisc.df <- read.csv(url)

Use as.matrix() to convert the features of the data (in columns 3 through 32) to a matrix. Store this in a variable called wisc.data.

# Convert the features of the data: wisc.data
wisc.data <- as.matrix(wisc.df[,c(3:32)])

Assign the row names of wisc.data the values currently contained in the id column of wisc.df. While not strictly required, this will help you keep track of the different observations throughout the modeling process.

# Set the row names of wisc.data
row.names(wisc.data) <- wisc.df$id

Finally, set a vector called diagnosis to be 1 if a diagnosis is malignant (“M”) and 0 otherwise. Note that R coerces TRUE to 1 and FALSE to 0.

# Create diagnosis vector
diagnosis <- as.numeric(wisc.df$diagnosis == "M")

Exploratory data analysis

The first step of any data analysis, unsupervised or supervised, is to familiarize yourself with the data.

The variables you created before, wisc.data and diagnosis, are still available in your workspace. Explore the data to answer the following questions:

How many observations are in this dataset? 569

str(wisc.data)
##  num [1:569, 1:30] 18 20.6 19.7 11.4 20.3 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:569] "842302" "842517" "84300903" "84348301" ...
##   ..$ : chr [1:30] "radius_mean" "texture_mean" "perimeter_mean" "area_mean" ...

How many variables/features in the data are suffixed with _mean? 10

colnames(wisc.data)
##  [1] "radius_mean"             "texture_mean"           
##  [3] "perimeter_mean"          "area_mean"              
##  [5] "smoothness_mean"         "compactness_mean"       
##  [7] "concavity_mean"          "concave.points_mean"    
##  [9] "symmetry_mean"           "fractal_dimension_mean" 
## [11] "radius_se"               "texture_se"             
## [13] "perimeter_se"            "area_se"                
## [15] "smoothness_se"           "compactness_se"         
## [17] "concavity_se"            "concave.points_se"      
## [19] "symmetry_se"             "fractal_dimension_se"   
## [21] "radius_worst"            "texture_worst"          
## [23] "perimeter_worst"         "area_worst"             
## [25] "smoothness_worst"        "compactness_worst"      
## [27] "concavity_worst"         "concave.points_worst"   
## [29] "symmetry_worst"          "fractal_dimension_worst"
str_match(colnames(wisc.data), "_mean")
##       [,1]   
##  [1,] "_mean"
##  [2,] "_mean"
##  [3,] "_mean"
##  [4,] "_mean"
##  [5,] "_mean"
##  [6,] "_mean"
##  [7,] "_mean"
##  [8,] "_mean"
##  [9,] "_mean"
## [10,] "_mean"
## [11,] NA     
## [12,] NA     
## [13,] NA     
## [14,] NA     
## [15,] NA     
## [16,] NA     
## [17,] NA     
## [18,] NA     
## [19,] NA     
## [20,] NA     
## [21,] NA     
## [22,] NA     
## [23,] NA     
## [24,] NA     
## [25,] NA     
## [26,] NA     
## [27,] NA     
## [28,] NA     
## [29,] NA     
## [30,] NA

How many of the observations have a malignant diagnosis? 212

table(diagnosis)
## diagnosis
##   0   1 
## 357 212

Performing PCA

The next step in your analysis is to perform PCA on wisc.data.

You saw in the last chapter that it’s important to check if the data need to be scaled before performing PCA. Recall two common reasons for scaling data:

The input variables use different units of measurement. The input variables have significantly different variances.

EXERCISES:

The variables you created before, wisc.data and diagnosis, are still available in your workspace.

Check the mean and standard deviation of the features of the data to determine if the data should be scaled. Use the colMeans() and apply() functions like you’ve done before.

# Check column means and standard deviations
colMeans(wisc.data)
##             radius_mean            texture_mean          perimeter_mean 
##            1.412729e+01            1.928965e+01            9.196903e+01 
##               area_mean         smoothness_mean        compactness_mean 
##            6.548891e+02            9.636028e-02            1.043410e-01 
##          concavity_mean     concave.points_mean           symmetry_mean 
##            8.879932e-02            4.891915e-02            1.811619e-01 
##  fractal_dimension_mean               radius_se              texture_se 
##            6.279761e-02            4.051721e-01            1.216853e+00 
##            perimeter_se                 area_se           smoothness_se 
##            2.866059e+00            4.033708e+01            7.040979e-03 
##          compactness_se            concavity_se       concave.points_se 
##            2.547814e-02            3.189372e-02            1.179614e-02 
##             symmetry_se    fractal_dimension_se            radius_worst 
##            2.054230e-02            3.794904e-03            1.626919e+01 
##           texture_worst         perimeter_worst              area_worst 
##            2.567722e+01            1.072612e+02            8.805831e+02 
##        smoothness_worst       compactness_worst         concavity_worst 
##            1.323686e-01            2.542650e-01            2.721885e-01 
##    concave.points_worst          symmetry_worst fractal_dimension_worst 
##            1.146062e-01            2.900756e-01            8.394582e-02
apply(wisc.data, 2, sd)
##             radius_mean            texture_mean          perimeter_mean 
##            3.524049e+00            4.301036e+00            2.429898e+01 
##               area_mean         smoothness_mean        compactness_mean 
##            3.519141e+02            1.406413e-02            5.281276e-02 
##          concavity_mean     concave.points_mean           symmetry_mean 
##            7.971981e-02            3.880284e-02            2.741428e-02 
##  fractal_dimension_mean               radius_se              texture_se 
##            7.060363e-03            2.773127e-01            5.516484e-01 
##            perimeter_se                 area_se           smoothness_se 
##            2.021855e+00            4.549101e+01            3.002518e-03 
##          compactness_se            concavity_se       concave.points_se 
##            1.790818e-02            3.018606e-02            6.170285e-03 
##             symmetry_se    fractal_dimension_se            radius_worst 
##            8.266372e-03            2.646071e-03            4.833242e+00 
##           texture_worst         perimeter_worst              area_worst 
##            6.146258e+00            3.360254e+01            5.693570e+02 
##        smoothness_worst       compactness_worst         concavity_worst 
##            2.283243e-02            1.573365e-01            2.086243e-01 
##    concave.points_worst          symmetry_worst fractal_dimension_worst 
##            6.573234e-02            6.186747e-02            1.806127e-02

Execute PCA on the wisc.data, scaling if appropriate, and assign the model to wisc.pr.

# Execute PCA, scaling if appropriate: wisc.pr
wisc.pr <- prcomp(wisc.data, scale = TRUE)

Inspect a summary of the results with the summary() function.

# Look at summary of results
summary(wisc.pr)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     3.6444 2.3857 1.67867 1.40735 1.28403 1.09880 0.82172
## Proportion of Variance 0.4427 0.1897 0.09393 0.06602 0.05496 0.04025 0.02251
## Cumulative Proportion  0.4427 0.6324 0.72636 0.79239 0.84734 0.88759 0.91010
##                            PC8    PC9    PC10   PC11    PC12    PC13    PC14
## Standard deviation     0.69037 0.6457 0.59219 0.5421 0.51104 0.49128 0.39624
## Proportion of Variance 0.01589 0.0139 0.01169 0.0098 0.00871 0.00805 0.00523
## Cumulative Proportion  0.92598 0.9399 0.95157 0.9614 0.97007 0.97812 0.98335
##                           PC15    PC16    PC17    PC18    PC19    PC20   PC21
## Standard deviation     0.30681 0.28260 0.24372 0.22939 0.22244 0.17652 0.1731
## Proportion of Variance 0.00314 0.00266 0.00198 0.00175 0.00165 0.00104 0.0010
## Cumulative Proportion  0.98649 0.98915 0.99113 0.99288 0.99453 0.99557 0.9966
##                           PC22    PC23   PC24    PC25    PC26    PC27    PC28
## Standard deviation     0.16565 0.15602 0.1344 0.12442 0.09043 0.08307 0.03987
## Proportion of Variance 0.00091 0.00081 0.0006 0.00052 0.00027 0.00023 0.00005
## Cumulative Proportion  0.99749 0.99830 0.9989 0.99942 0.99969 0.99992 0.99997
##                           PC29    PC30
## Standard deviation     0.02736 0.01153
## Proportion of Variance 0.00002 0.00000
## Cumulative Proportion  1.00000 1.00000

Interpreting PCA results

Now you’ll use some visualizations to better understand your PCA model. You were introduced to one of these visualizations, the biplot, in an earlier chapter.

You’ll run into some common challenges with using biplots on real-world data containing a non-trivial number of observations and variables, then you’ll look at some alternative visualizations. You are encouraged to experiment with additional visualizations before moving on to the next exercise.

EXERCISES:

The variables you created before, wisc.data, diagnosis, and wisc.pr, are still available.

Create a biplot of the wisc.pr data. What stands out to you about this plot? Is it easy or difficult to understand? Why?

# Create a biplot of wisc.pr
biplot(wisc.pr)

Execute the code to scatter plot each observation by principal components 1 and 2, coloring the points by the diagnosis.

# Scatter plot observations by components 1 and 2
plot(wisc.pr$x[, c(1, 2)], col = (diagnosis + 1), 
     xlab = "PC1", ylab = "PC2")

Repeat the same for principal components 1 and 3. What do you notice about these plots?

# Repeat for components 1 and 3
plot(wisc.pr$x[, c(1, 3)], col = (diagnosis + 1), 
     xlab = "PC1", ylab = "PC3")

# Do additional data exploration of your choosing below (optional)

plot(wisc.pr$x[, c(2, 3)], col = (diagnosis + 1), 
     xlab = "PC2", ylab = "PC3")

We can see from the charts that pc1 and pc2 overlap less than pc1 and pc3.

This is expected as pc1 and pc2 are meant to be orthogonal and explain different variance

pc2 and pc3 overlap more then either of them overlap with pc1

Variance explained

In this exercise, you will produce scree plots showing the proportion of variance explained as the number of principal components increases. The data from PCA must be prepared for these plots, as there is not a built-in function in R to create them directly from the PCA model.

As you look at these plots, ask yourself if there’s an elbow in the amount of variance explained that might lead you to pick a natural number of principal components. If an obvious elbow does not exist, as is typical in real-world datasets, consider how else you might determine the number of principal components to retain based on the scree plot.

EXERCISES:

The variables you created before, wisc.data, diagnosis, and wisc.pr, are still available.

Calculate the variance of each principal component by squaring the sdev component of wisc.pr. Save the result as an object called pr.var.

# Set up 1 x 2 plotting grid
par(mfrow = c(1, 2))
# Calculate variability of each component
pr.var <- wisc.pr$sdev^2

Calculate the variance explained by each principal component by dividing by the total variance explained of all principal components. Assign this to a variable called pve.

# Variance explained by each principal component: pve
pve <- pr.var / sum(pr.var)

Create a plot of variance explained for each principal component. Using the cumsum() function, create a plot of cumulative proportion of variance explained.

# Plot variance explained for each principal component
plot(pve, xlab = "Principal Component", 
     ylab = "Proportion of Variance Explained", 
     ylim = c(0, 1), type = "b")

# Plot cumulative proportion of variance explained
plot(cumsum(pve), xlab = "Principal Component", 
     ylab = "Cumulative Proportion of Variance Explained", 
     ylim = c(0, 1), type = "b")

### PCA review and next steps

Hierarchical clustering of case data

The goal of this exercise is to do hierarchical clustering of the observations. Recall from Chapter 2 that this type of clustering does not assume in advance the number of natural groups that exist in the data.

As part of the preparation for hierarchical clustering, distance between all pairs of observations are computed. Furthermore, there are different ways to link clusters together, with single, complete, and average being the most common linkage methods.

EXERCISE:

The variables you created before, wisc.data, diagnosis, wisc.pr, and pve, are available in your workspace.

Scale the wisc.data data and assign the result to data.scaled.

# Scale the wisc.data data: data.scaled

data.scaled <- scale(wisc.data)

Calculate the (Euclidean) distances between all pairs of observations in the new scaled dataset and assign the result to data.dist.

# Calculate the (Euclidean) distances: data.dist
data.dist <- dist(data.scaled)

Create a hierarchical clustering model using complete linkage. Manually specify the method argument to hclust() and assign the results to wisc.hclust.

# Create a hierarchical clustering model: wisc.hclust
wisc.hclust <- hclust(data.dist, method  = "complete")

Results of hierarchical clustering

Let’s use the hierarchical clustering model you just created to determine a height (or distance between clusters) where a certain number of clusters exists. The variables you created before—wisc.data, diagnosis, wisc.pr, pve, and wisc.hclust—are all available in your workspace.

Using the plot() function, what is the height at which the clustering model has 4 clusters?

plot(wisc.hclust)

I can kinda see why we could cut at 4. that gives us the to main clusers and then we have a couple tiny ones on the left. It would be cool if we could color the lines by diagnosis somehow that helps us see where we should split.

Selecting number of clusters

In this exercise, you will compare the outputs from your hierarchical clustering model to the actual diagnoses. Normally when performing unsupervised learning like this, a target variable isn’t available. We do have it with this dataset, however, so it can be used to check the performance of the clustering model.

When performing supervised learning—that is, when you’re trying to predict some target variable of interest and that target variable is available in the original data—using clustering to create new features may or may not improve the performance of the final model. This exercise will help you determine if, in this case, hierarchical clustering provides a promising new feature.

EXERCISES:

wisc.data, diagnosis, wisc.pr, pve, and wisc.hclust are available in your workspace.

Use cutree() to cut the tree so that it has 4 clusters. Assign the output to the variable wisc.hclust.clusters.

# Cut tree so that it has 4 clusters: wisc.hclust.clusters
wisc.hclust.clusters <- cutree(wisc.hclust, k = 4)

Use the table() function to compare the cluster membership to the actual diagnoses.

# Compare cluster membership to actual diagnoses
table(wisc.hclust.clusters, diagnosis)
##                     diagnosis
## wisc.hclust.clusters   0   1
##                    1  12 165
##                    2   2   5
##                    3 343  40
##                    4   0   2

k-means clustering and comparing results

As you now know, there are two main types of clustering: hierarchical and k-means.

In this exercise, you will create a k-means clustering model on the Wisconsin breast cancer data and compare the results to the actual diagnoses and the results of your hierarchical clustering model. Take some time to see how each clustering model performs in terms of separating the two diagnoses and how the clustering models compare to each other.

Exercises:

wisc.data, diagnosis, and wisc.hclust.clusters are still available.

Create a k-means model on wisc.data, assigning the result to wisc.km. Be sure to create 2 clusters, corresponding to the actual number of diagnosis. Also, remember to scale the data and repeat the algorithm 20 times to find a well performing model.

# Create a k-means model on wisc.data: wisc.km
wisc.km <- kmeans(scale(wisc.data), 2, nstart = 20)

Use the table() function to compare the cluster membership of the k-means model to the actual diagnoses contained in the diagnosis vector. How well does k-means separate the two diagnoses?

# Compare k-means to actual diagnoses
table(wisc.km$cluster, diagnosis)
##    diagnosis
##       0   1
##   1 343  37
##   2  14 175
sum(apply(table(wisc.km$cluster, diagnosis), 1, min))
## [1] 51

Use the table() function to compare the cluster membership of the k-means model to the hierarchical clustering model. Recall the cluster membership of the hierarchical clustering model is contained in wisc.hclust.clusters.

# Compare k-means to hierarchical clustering
table(wisc.hclust.clusters, wisc.km$cluster)
##                     
## wisc.hclust.clusters   1   2
##                    1  17 160
##                    2   0   7
##                    3 363  20
##                    4   0   2
sum(apply(table(wisc.hclust.clusters, wisc.km$cluster), 1, min))
## [1] 37

Clustering on PCA results

In this final exercise, you will put together several steps you used earlier and, in doing so, you will experience some of the creativity that is typical in unsupervised learning.

Recall from earlier exercises that the PCA model required significantly fewer features to describe 80% and 95% of the variability of the data. In addition to normalizing data and potentially avoiding overfitting, PCA also uncorrelates the variables, sometimes improving the performance of other modeling techniques.

Let’s see if PCA improves or degrades the performance of hierarchical clustering.

EXERCISES:

wisc.pr, diagnosis, wisc.hclust.clusters, and wisc.km are still available in your workspace.

Using the minimum number of principal components required to describe at least 90% of the variability in the data, create a hierarchical clustering model with complete linkage. Assign the results to wisc.pr.hclust.

# Create a hierarchical clustering model: wisc.pr.hclust
wisc.pr.hclust <- hclust(dist(wisc.pr$x[, 1:7]), method = "complete")

Cut this hierarchical clustering model into 4 clusters and assign the results to wisc.pr.hclust.clusters.

# Cut model into 4 clusters: wisc.pr.hclust.clusters

wisc.pr.hclust.clusters <- cutree(wisc.pr.hclust , k = 4)

Using table(), compare the results from your new hierarchical clustering model with the actual diagnoses. How well does the newly created model with four clusters separate out the two diagnoses?

# Compare to actual diagnoses
t <- table(wisc.pr.hclust.clusters, diagnosis)
t
##                        diagnosis
## wisc.pr.hclust.clusters   0   1
##                       1   5 113
##                       2 350  97
##                       3   2   0
##                       4   0   2
sum(apply(t, 1, min))
## [1] 102

How well do the k-means and hierarchical clustering models you created in previous exercises do in terms of separating the diagnoses? Again, use the table() function to compare the output of each model with the vector containing the actual diagnoses.

# Compare to k-means and hierarchical
e <- table(wisc.km$cluster, diagnosis)
e
##    diagnosis
##       0   1
##   1 343  37
##   2  14 175
sum(apply(e, 1, min))
## [1] 51

CONCLUSION :

It looks like the 2 cluster k-means does the best job The whole purpose of this is to see if the results of clustering could be useful in a supervised learning process. I think it might be worth adding the k-means clusters to a model. Maybe. I guess I could just try it with and with out and see whic is best at predicting now that I know how to do that. : )

19.- Cluster Analysis in R

Calculating distance between observations

Distance between two observations

Calculate & plot the distance between two players

You’ve obtained the coordinates relative to the center of the field for two players in a soccer match and would like to calculate the distance between them.

In this exercise you will plot the positions of the 2 players and manually calculate the distance between them by using the Euclidean distance formula.

EXERCISES:

Plot their positions from the two_players data frame using ggplot.

lineup <- read.csv("./Data/lineup.csv")
two_players <- read.csv("./Data/two_players.csv", sep = ";")
colnames(two_players)[1] <- "x"
colnames(two_players)[2] <- "y"
head(two_players)
##    x  y
## 1  5  4
## 2 15 10
ggplot(two_players, aes(x = x, y = y)) + 
  geom_point() +
  # Assuming a 40x60 field
  lims(x = c(-30,30), y = c(-20, 20))

Extract the positions of the players into two data frames player1 and player2.

# Split the players data frame into two observations
player1 <- two_players[1, ]
player1
##   x y
## 1 5 4
player2 <- two_players[2, ]

Calculate the distance between player1 and player2 by using the Euclidean distance formula

# Calculate and print their distance using the Euclidean Distance formula
player_distance <- sqrt( (player1$x - player2$x)^2 + (player1$y - player2$y)^2 )
player_distance
## [1] 11.6619

Using the dist() function

Using the Euclidean formula manually may be practical for 2 observations but can get more complicated rather quickly when measuring the distance between many observations.

The dist() function simplifies this process by calculating distances between our observations (rows) using their features (columns). In this case the observations are the player positions and the dimensions are their x and y coordinates.

Note: The default distance calculation for the dist() function is Euclidean distance

EXERCISE:

Calculate the distance between two players using the dist() function for the data frame two_players

# Calculate the Distance Between two_players
dist_two_players <- dist(two_players)
dist_two_players
##         1
## 2 11.6619

Calculate the distance between three players for the data frame three_players

three_players <- read.csv("./Data/three_players.csv")
# Calculate the Distance Between three_players

dist_three_players <- dist(three_players)
dist_three_players
##          1        2
## 2 11.70470         
## 3 16.88194 18.05547

The importance of scale

Effects of scale

You have learned that when a variable is on a larger scale than other variables in your data it may disproportionately influence the resulting distance calculated between your observations. Lets see this in action by observing a sample of data from the trees data set.

You will leverage the scale() function which by default centers & scales our column features.

Our variables are the following:

Girth - tree diameter in inches Height - tree height in inches

EXERCISE:

Calculate the distance matrix for the data frame three_trees and store it as dist_trees

three_trees <- read.csv("./Data/three_trees.csv")

Create a new variable scaled_three_trees where the three_trees data is centered & scaled

# Calculate distance for three_trees 
head(three_trees)
##   X    x   y
## 1 1  8.3 840
## 2 2  8.6 780
## 3 3 10.5 864
dist_trees <- dist(three_trees)

Calculate and print the distance matrix for scaled_three_trees and store this as dist_scaled_trees

# Scale three trees & calculate the distance  
scaled_three_trees <- scale(three_trees)
dist_scaled_trees <- dist(scaled_three_trees)

Output both dist_trees and dist_scaled_trees matrices and observe the change of which observations have the smallest distance between the two matrices (hint: they have changed)

# Output the results of both Matrices
print('Without Scaling')
## [1] "Without Scaling"
dist_trees
##          1        2
## 2 60.00908         
## 3 24.18347 84.02744
print('With Scaling')
## [1] "With Scaling"
dist_scaled_trees
##          1        2
## 2 1.728094         
## 3 2.776357 2.702874

Measuring distance for categorical data

Calculating distance between categorical variables

In this exercise you will explore how to calculate binary (Jaccard) distances. In order to calculate distances we will first have to dummify our categories using the dummy.data.frame() from the library dummies

You will use a small collection of survey observations stored in the data frame job_survey with the following columns:

job_satisfaction Possible options: “Hi”, “Mid”, “Low” is_happy Possible options: “Yes”, “No”

job_survey <- read.csv("./Data/job_survey.csv", sep = ";", col.names = c("job_satisfaction", "is_happy"))
head(job_survey)
##   job_satisfaction is_happy
## 1              Low      No 
## 2              Low      No 
## 3               Hi      Yes
## 4              Low      No 
## 5              Mid      No

EXERCISES:

Create a dummified data frame dummy_survey

# Dummify the Survey Data
dummy_survey <- dummy.data.frame(job_survey)
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored

Generate a Jaccard distance matrix for the dummified survey data dist_survey using the dist() function using the parameter method = ‘binary’

# Calculate the Distance
dist_survey <- dist(dummy_survey, method = "binary")

Print the original data and the distance matrix Note the observations with a distance of 0 in the original data (1, 2, and 4)

# Print the Original Data
job_survey
##   job_satisfaction is_happy
## 1              Low      No 
## 2              Low      No 
## 3               Hi      Yes
## 4              Low      No 
## 5              Mid      No
# Print the Distance Matrix
dist_survey
##           1         2         3         4
## 2 0.0000000                              
## 3 1.0000000 1.0000000                    
## 4 0.0000000 0.0000000 1.0000000          
## 5 0.6666667 0.6666667 1.0000000 0.6666667

Hierarchical clustering

Comparing more than two observations

Calculating linkage

Let us revisit the example with three players on a field. The distance matrix between these three players is shown below and is available as the variable dist_players.

From this we can tell that the first group that forms is between players 1 & 2, since they are the closest to one another with a Euclidean distance value of 11.

Now you want to apply the three linkage methods you have learned to determine what the distance of this group is to player 3.

dist_players <- read.csv("./Data/dist_players.csv", sep = ";")
colnames(dist_players)[1] <- "1"
colnames(dist_players)[2] <- "2"
head(dist_players)
##    1  2
## 1 11  0
## 2 16 18

EXERCISES:

Calculate the distance from player 3 to the group of players 1 & 2 using the following three linkage methods.

# Extract the pair distances
#distance_1_2 <- dist_players[1]
#distance_1_3 <- dist_players[2]
#distance_2_3 <- dist_players[3]

Complete: the resulting distance is based on the maximum.

# Calculate the complete distance between group 1-2 and 3
#complete <- max(c(distance_1_3, distance_2_3))
#complete

Single: the resulting distance is based on the minimum.

# Calculate the single distance between group 1-2 and 3
#single <- min(c(distance_1_3, distance_2_3))
#single

Average: the resulting distance is based on the average.

# Calculate the average distance between group 1-2 and 3
#average <- mean(c(distance_1_3, distance_2_3))
#average

Capturing K clusters

Assign cluster membership

In this exercise you will leverage the hclust() function to calculate the iterative linkage steps and you will use the cutree() function to extract the cluster assignments for the desired number (k) of clusters.

You are given the positions of 12 players at the start of a 6v6 soccer match. This is stored in the lineup data frame.

You know that this match has two teams (k = 2), let’s use the clustering methods you learned to assign which team each player belongs in based on their position.

Notes:

The linkage method can be passed via the method parameter: hclust(distance_matrix, method = “complete”) Remember that in soccer opposing teams start on their half of the field. Because these positions are measured using the same scale we do not need to re-scale our data.

Exercise:

Calculate the Euclidean distance matrix dist_players among all twelve players

# Calculate the Distance
dist_players <- dist(lineup)

Perform the complete linkage calculation for hierarchical clustering using hclust and store this as hc_players

# Perform the hierarchical clustering using the complete linkage
hc_players <- hclust(dist_players, method = "complete")

Build the cluster assignment vector clusters_k2 using cutree() with a k = 2

# Calculate the assignment vector with a k of 2
clusters_k2 <- cutree(hc_players, k = 2)
clusters_k2
##  [1] 1 1 2 2 1 1 1 2 2 2 1 2

Append the cluster assignments as a column cluster to the lineup data frame and save the results to a new data frame called lineup_k2_complete

# Create a new data frame storing these results
lineup_k2_complete <- mutate(lineup, cluster = clusters_k2)

Exploring the clusters

Because clustering analysis is always in part qualitative, it is incredibly important to have the necessary tools to explore the results of the clustering.

In this exercise you will explore that data frame you created in the previous exercise lineup_k2_complete.

Reminder: The lineup_k2_complete data frame contains the x & y positions of 12 players at the start of a 6v6 soccer game to which you have added clustering assignments based on the following parameters:

Distance: Euclidean Number of Clusters (k): 2 Linkage Method: Complete

EXERCIES:

Using count() from dplyr, count the number of players assigned to each cluster.

head(lineup_k2_complete)
##   X   x  y cluster
## 1 1  -1  1       1
## 2 2  -2 -3       1
## 3 3   8  6       2
## 4 4   7 -8       2
## 5 5 -12  8       1
## 6 6 -15  0       1
# Count the cluster assignments
count(lineup_k2_complete, cluster)
##   cluster n
## 1       1 6
## 2       2 6

Using ggplot(), plot the positions of the players and color them by cluster assignment.

# Plot the positions of the players and color them using their cluster
ggplot(lineup_k2_complete, aes(x = x, y = y, color = factor(cluster))) +
  geom_point()

Visualizing the dendrogram

Comparing average, single & complete linkage

You are now ready to analyze the clustering results of the lineup dataset using the dendrogram plot. This will give you a new perspective on the effect the decision of the linkage method has on your resulting cluster analysis.

EXERCISES

# Prepare the Distance Matrix
dist_players <- dist(lineup)

Perform the linkage calculation for hierarchical clustering using the linkages: complete, single and average

# Generate hclust for complete, single & average linkage methods
hc_complete <- hclust(dist_players, method = "complete")
hc_single <- hclust(dist_players, method = "single")
hc_average <- hclust(dist_players, method = "average")

Plot the three dendrograms side by side and review the changes

# Plot & Label the 3 Dendrograms Side-by-Side
# Hint: To see these Side-by-Side run the 4 lines together as one command
par(mfrow = c(1,3))
plot(hc_complete, main = 'Complete Linkage')

plot(hc_single, main = 'Single Linkage')

plot(hc_average, main = 'Average Linkage')

Cutting the tree

Clusters based on height

In previous exercises you have grouped your observations into clusters using a pre-defined number of clusters (k). In this exercise you will leverage the visual representation of the dendrogram in order to group your observations into clusters using a maximum height (h), below which clusters form.

You will work the color_branches() function from the dendextend library in order to visually inspect the clusters that form at any height along the dendrogram.

The hc_players has been carried over from your previous work with the soccer line-up data.

exercises:

Create a dendrogram object dend_players from your hclust result using the function as.dendrogram()

library(dendextend)
dist_players <- dist(lineup, method = 'euclidean')
hc_players <- hclust(dist_players, method = "complete")

# Create a dendrogram object from the hclust variable
dend_players <- as.dendrogram(hc_players)

Plot the dendrogram

# Plot the dendrogram
plot(dend_players)

Using the color_branches() function create & plot a new dendrogram with clusters colored by a cut height of 20

# Color branches by cluster formed from the cut at a height of 20 & plot
dend_20 <- color_branches(dend_players, h = 20)

# Plot the dendrogram with clusters colored below height 20
plot(dend_20)

Repeat the above step with a height of 40

# Color branches by cluster formed from the cut at a height of 40 & plot
dend_40 <- color_branches(dend_players, h = 40)

# Plot the dendrogram with clusters colored below height 40
plot(dend_40)

Exploring the branches cut from the tree

The cutree() function you used in exercises 5 & 6 can also be used to cut a tree at a given height by using the h parameter. Take a moment to explore the clusters you have generated from the previous exercises based on the heights 20 & 40.

EXERCISES:

Build the cluster assignment vector clusters_h20 using cutree() with a h = 20

dist_players <- dist(lineup, method = 'euclidean')
hc_players <- hclust(dist_players, method = "complete")

# Calculate the assignment vector with a h of 20
clusters_h20 <- cutree(hc_players, h = 20)

Append the cluster assignments as a column cluster to the lineup data frame and save the results to a new data frame called lineup_h20_complete

# Create a new data frame storing these results
lineup_h20_complete <- mutate(lineup, cluster = clusters_h20)

Repeat the above two steps for a height of 40, generating the variables clusters_h40 and lineup_h40_complete

# Calculate the assignment vector with a h of 40
clusters_h40 <- cutree(hc_players, h = 40)

Use ggplot2 to create a scatter plot, colored by the cluster assignment for both heights

# Create a new data frame storing these results
lineup_h40_complete <- mutate(lineup, cluster = clusters_h40)
lineup_h40_complete
##     X   x   y cluster
## 1   1  -1   1       1
## 2   2  -2  -3       1
## 3   3   8   6       2
## 4   4   7  -8       2
## 5   5 -12   8       1
## 6   6 -15   0       1
## 7   7 -13 -10       1
## 8   8  15  16       2
## 9   9  21   2       2
## 10 10  12 -15       2
## 11 11 -25   1       1
## 12 12  26   0       2
# Plot the positions of the players and color them using their cluster for height = 20
ggplot(lineup_h20_complete, aes(x = x, y = y, color = factor(cluster))) +
  geom_point()

# Plot the positions of the players and color them using their cluster for height = 40
ggplot(lineup_h40_complete, aes(x = x, y = y, color = factor(cluster))) +
  geom_point()

Making sense of the clusters

Segment wholesale customers

You’re now ready to use hierarchical clustering to perform market segmentation (i.e. use consumer characteristics to group them into subgroups).

In this exercise you are provided with the amount spent by 45 different cliencustomers_spend <- ts of a wholesale distributor for the food categories of Milk, Grocery & Frozen. This is stored in the data frame customers_spend. Assign these clients into meaningful clusters.

Note: For this exercise you can assume that because the data is all of the same type (amount spent) and you will not need to scale it.

customers_spend <- read.csv("./Data/customers_spend.csv", sep = ",")

EXERCISES:

Calculate the Euclidean distance between the customers and store this in dist_customers

# Calculate Euclidean distance between customers
dist_customers <- dist(customers_spend)

Run hierarchical clustering using complete linkage and store in hc_customers

# Generate a complete linkage analysis 
hc_customers <- hclust(dist_customers, method = "complete")

Plot the dendrogram

# Plot the dendrogram
plot(hc_customers)

Create a cluster assignment vector using a height of 15,000 and store it as clust_customers

# Create a cluster assignment vector at h = 15000
clust_customers <- cutree(hc_customers, h = 15000)

Generate a new data frame segment_customers by appending the cluster assignment as the column cluster to the original customers_spend data frame

# Generate the segmented customers data frame
segment_customers <- mutate(customers_spend, cluster = clust_customers)
head(segment_customers)
##   X  Milk Grocery Frozen cluster
## 1 1 11103   12469    902       1
## 2 2  2013    6550    909       2
## 3 3  1897    5234    417       2
## 4 4  1304    3643   3045       2
## 5 5  3199    6986   1455       2
## 6 6  4560    9965    934       2

Explore wholesale customer clusters

Continuing your work on the wholesale dataset you are now ready to analyze the characteristics of these clusters.

Since you are working with more than 2 dimensions it would be challenging to visualize a scatter plot of the clusters, instead you will rely on summary statistics to explore these clusters. In this exercise you will analyze the mean amount spent in each cluster for all three categories.

EXERCISE:

Calculate the size of each cluster using count().

dist_customers <- dist(customers_spend)
hc_customers <- hclust(dist_customers)
clust_customers <- cutree(hc_customers, h = 15000)
segment_customers <- mutate(customers_spend, cluster = clust_customers)
head(segment_customers)
##   X  Milk Grocery Frozen cluster
## 1 1 11103   12469    902       1
## 2 2  2013    6550    909       2
## 3 3  1897    5234    417       2
## 4 4  1304    3643   3045       2
## 5 5  3199    6986   1455       2
## 6 6  4560    9965    934       2
# Count the number of customers that fall into each cluster
count(segment_customers, cluster)
##   cluster  n
## 1       1  5
## 2       2 29
## 3       3  5
## 4       4  6

Color & plot the dendrogram using the height of 15,000.

# Color the dendrogram based on the height cutoff
dend_customers <- as.dendrogram(hc_customers)
dend_colored <- color_branches(dend_customers, h = 15000)
# Plot the colored dendrogram
plot(dend_colored)

Calculate the average spending for each category within each cluster using the summarise_all() function.

# Calculate the mean for each category
segment_customers %>% 
  group_by(cluster) %>% 
  summarise_all(list(mean))
## # A tibble: 4 x 5
##   cluster     X   Milk Grocery Frozen
##     <int> <dbl>  <dbl>   <dbl>  <dbl>
## 1       1  16.2 16950   12891.   991.
## 2       2  21.9  2513.   5229.  1796.
## 3       3  18.6 10452.  22551.  1355.
## 4       4  37.5  1250.   3917. 10889.

Introduction to K-means

K-means on a soccer field

In the previous chapter you used the lineup dataset to learn about hierarchical clustering, in this chapter you will use the same data to learn about k-means clustering. As a reminder, the lineup data frame contains the positions of 12 players at the start of a 6v6 soccer match.

Just like before, you know that this match has two teams on the field so you can perform a k-means analysis using k = 2 in order to determine which player belongs to which team.

Note that in the kmeans() function k is specified using the centers parameter.

EXERCISES:
head(lineup)
##   X   x  y
## 1 1  -1  1
## 2 2  -2 -3
## 3 3   8  6
## 4 4   7 -8
## 5 5 -12  8
## 6 6 -15  0

Build a k-means model called model_km2 for the lineup data using the kmeans() function with centers = 2

# Build a kmeans model
model_km2 <- kmeans(lineup, centers = 2)
head(model_km2)
## $cluster
##  [1] 1 1 2 2 1 1 1 2 2 2 1 2
## 
## $centers
##          X         x          y
## 1 5.333333 -11.33333 -0.5000000
## 2 7.666667  14.83333  0.1666667
## 
## $totss
## [1] 3632.917
## 
## $withinss
## [1] 636.1667 925.0000
## 
## $tot.withinss
## [1] 1561.167
## 
## $betweenss
## [1] 2071.75

Extract the vector of cluster assignments from the model model_km2$cluster and store this in the variable clust_km2

# Extract the cluster assignment vector from the kmeans model
clust_km2 <- model_km2$cluster

Append the cluster assignments as a column cluster to the lineup data frame and save the results to a new data frame called lineup_km2 Use ggplot to plot the positions of each player on the field and color them by their cluster

# Create a new data frame appending the cluster assignment
lineup_km2 <- mutate(lineup, cluster = model_km2$cluster)
head(lineup_km2)
##   X   x  y cluster
## 1 1  -1  1       1
## 2 2  -2 -3       1
## 3 3   8  6       2
## 4 4   7 -8       2
## 5 5 -12  8       1
## 6 6 -15  0       1
# Plot the positions of the players and color them using their cluster
ggplot(lineup_km2, aes(x = x, y = y, color = factor(cluster))) +
  geom_point()

K-means on a soccer field (part 2)

In the previous exercise you successfully used the k-means algorithm to cluster the two teams from the lineup data frame. This time, let’s explore what happens when you use a k of 3.

You will see that the algorithm will still run, but does it actually make sense in this context…

EXERCISES

head(lineup)
##   X   x  y
## 1 1  -1  1
## 2 2  -2 -3
## 3 3   8  6
## 4 4   7 -8
## 5 5 -12  8
## 6 6 -15  0

Build a k-means model called model_km3 for the lineup data using the kmeans() function with centers = 3

# Build a kmeans model
model_km3 <- kmeans(lineup, 3)
head(model_km3)
## $cluster
##  [1] 3 3 1 2 3 3 3 1 1 2 3 1
## 
## $centers
##          X         x     y
## 1 8.000000  17.50000   6.0
## 2 7.000000   9.50000 -11.5
## 3 5.333333 -11.33333  -0.5
## 
## $totss
## [1] 3632.917
## 
## $withinss
## [1] 375.0000  55.0000 636.1667
## 
## $tot.withinss
## [1] 1066.167
## 
## $betweenss
## [1] 2566.75

Extract the vector of cluster assignments from the model model_km3$cluster and store this in the variable clust_km3

# Extract the cluster assignment vector from the kmeans model
clust_km3 <- model_km3$cluster

Append the cluster assignments as a column cluster to the lineup data frame and save the results to a new data frame called lineup_km3

# Create a new data frame appending the cluster assignment
lineup_km3 <- mutate(lineup, cluster = model_km3$cluster)
head(lineup_km3)
##   X   x  y cluster
## 1 1  -1  1       3
## 2 2  -2 -3       3
## 3 3   8  6       1
## 4 4   7 -8       2
## 5 5 -12  8       3
## 6 6 -15  0       3

Use ggplot to plot the positions of each player on the field and color them by their cluster

# Plot the positions of the players and color them using their cluster
ggplot(lineup_km3, aes(x = x, y = y, color = factor(cluster))) +
  geom_point()

Evaluating different values of K by eye

Many K’s many models

While the lineup dataset clearly has a known value of k, often times the optimal number of clusters isn’t known and must be estimated.

In this exercise you will leverage map_dbl() from the purrr library to run k-means using values of k ranging from 1 to 10 and extract the total within-cluster sum of squares metric from each one. This will be the first step towards visualizing the elbow plot.

EXERCISES:

library(purrr)
head(lineup)
##   X   x  y
## 1 1  -1  1
## 2 2  -2 -3
## 3 3   8  6
## 4 4   7 -8
## 5 5 -12  8
## 6 6 -15  0

Use map_dbl() to run kmeans() using the lineup data for k values ranging from 1 to 10 and extract the total within-cluster sum of squares value from each model: model$tot.withinss Store the resulting vector as tot_withinss

# Use map_dbl to run many models with varying value of k (centers)
tot_withinss <- map_dbl(1:10,  function(k){
  model <- kmeans(x = lineup, centers = k)
  model$tot.withinss
})

Build a new data frame elbow_df containing the values of k and the vector of total within-cluster sum of squares

# Generate a data frame containing both k and tot_withinss
elbow_df <- data.frame(
  k = 1:10 ,
  tot_withinss = tot_withinss
)

Elbow (Scree) plot

In the previous exercises you have calculated the total within-cluster sum of squares for values of k ranging from 1 to 10. You can visualize this relationship using a line plot to create what is known as an elbow plot (or scree plot).

When looking at an elbow plot you want to see a sharp decline from one k to another followed by a more gradual decrease in slope. The last value of k before the slope of the plot levels off suggests a “good” value of k.

EXERCISES:

Continuing your work from the previous exercise, use the values in elbow_df to plot a line plot showing the relationship between k and total within-cluster sum of squares

# Use map_dbl to run many models with varying value of k (centers)
tot_withinss <- map_dbl(1:10,  function(k){
  model <- kmeans(x = lineup, centers = k)
  model$tot.withinss
})

# Generate a data frame containing both k and tot_withinss
elbow_df <- data.frame(
  k = 1:10,
  tot_withinss = tot_withinss
)
head(elbow_df)
##   k tot_withinss
## 1 1    3632.9167
## 2 2    1561.1667
## 3 3    1066.1667
## 4 4     723.7500
## 5 5     462.2500
## 6 6     321.3333
# Plot the elbow plot
ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
  geom_line() +
  scale_x_continuous(breaks = 1:10)

Silhouette analysis: observation level performance

La logica:

V <= 1>>>> ESTA BIEN EN EL CLUSTER QUE LE CORRESPONDE V == 0 >>> ESTA EN TRE LOS DOS CLUSTERS V < -1 >>> PERTENECE AL OTRO CLUSTER

Silhouette analysis

Silhouette analysis allows you to calculate how similar each observations is with the cluster it is assigned relative to other clusters. This metric (silhouette width) ranges from -1 to 1 for each observation in your data and can be interpreted as follows:

Values close to 1 suggest that the observation is well matched to the assigned cluster Values close to 0 suggest that the observation is borderline matched between two clusters Values close to -1 suggest that the observations may be assigned to the wrong cluster In this exercise you will leverage the pam() and the silhouette() functions from the cluster library to perform silhouette analysis to compare the results of models with a k of 2 and a k of 3. You’ll continue working with the lineup dataset.

Pay close attention to the silhouette plot, does each observation clearly belong to its assigned cluster for k = 3?

EXERCISES:

library(cluster)
head(lineup)
##   X   x  y
## 1 1  -1  1
## 2 2  -2 -3
## 3 3   8  6
## 4 4   7 -8
## 5 5 -12  8
## 6 6 -15  0

Generate a k-means model pam_k2 using pam() with k = 2 on the lineup data.

# Generate a k-means model using the pam() function with a k = 2
pam_k2 <- pam(lineup, k = 2)

Plot the silhouette analysis using plot(silhouette(model)).

# Plot the silhouette visual for the pam_k2 model
plot(silhouette(pam_k2))

Repeat the first two steps for k = 3, saving the model as pam_k3.

# Generate a k-means model using the pam() function with a k = 3
pam_k3 <- pam(lineup, k = 3)

Make sure to review the differences between the plots before proceeding (especially observation 3) for pam_k3.

# Plot the silhouette visual for the pam_k3 model
plot(silhouette(pam_k3))

Making sense of the K-means clusters

Revisiting wholesale data: “Best” k

At the end of Chapter 2 you explored wholesale distributor data customers_spend using hierarchical clustering. This time you will analyze this data using the k-means clustering tools covered in this chapter.

The first step will be to determine the “best” value of k using average silhouette width.

A refresher about the data: it contains records of the amount spent by 45 different clients of a wholesale distributor for the food categories of Milk, Grocery & Frozen. This is stored in the data frame customers_spend. For this exercise you can assume that because the data is all of the same type (amount spent) and you will not need to scale it.

EXERCISES:

Use map_dbl() to run pam() using the customers_spend data for k values ranging from 2 to 10 and extract the average silhouette width value from each model: model\(silinfo\)avg.width Store the resulting vector as sil_width

head(customers_spend)
##   X  Milk Grocery Frozen
## 1 1 11103   12469    902
## 2 2  2013    6550    909
## 3 3  1897    5234    417
## 4 4  1304    3643   3045
## 5 5  3199    6986   1455
## 6 6  4560    9965    934
# Use map_dbl to run many models with varying value of k
sil_width <- map_dbl(2:10,  function(k){
  model <- pam(x = customers_spend, k = k)
  model$silinfo$avg.width
})

Build a new data frame sil_df containing the values of k and the vector of average silhouette widths

# Generate a data frame containing both k and sil_width
sil_df <- data.frame(
  k = 2:10,
  sil_width = sil_width
)
head(sil_df)
##   k sil_width
## 1 2 0.5842268
## 2 3 0.3741894
## 3 4 0.4225022
## 4 5 0.4397856
## 5 6 0.3609155
## 6 7 0.3270821

Use the values in sil_df to plot a line plot showing the relationship between k and average silhouette width

# Plot the relationship between k and sil_width
ggplot(sil_df, aes(x = k, y = sil_width)) +
  geom_line() +
  scale_x_continuous(breaks = 2:10)

Revisiting wholesale data: Exploration

From the previous analysis you have found that k = 2 has the highest average silhouette width. In this exercise you will continue to analyze the wholesale customer data by building and exploring a kmeans model with 2 clusters.

EXERCISES:

Build a k-means model called model_customers for the customers_spend data using the kmeans() function with centers = 2.

set.seed(42)

# Build a k-means model for the customers_spend with a k of 2
model_customers <- kmeans(customers_spend,  2)
head(model_customers)
## $cluster
##  [1] 2 1 1 1 1 1 1 1 2 1 1 2 2 2 1 1 2 1 1 2 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2
## [39] 1 1 1 1 1 1 1
## 
## $centers
##      X      Milk Grocery   Frozen
## 1 24.6  2296.257    5004 3354.343
## 2 17.4 13701.100   17721 1173.000
## 
## $totss
## [1] 3896613605
## 
## $withinss
## [1] 947228299 642878787
## 
## $tot.withinss
## [1] 1590107086
## 
## $betweenss
## [1] 2306506519

Extract the vector of cluster assignments from the model model_customers$cluster and store this in the variable clust_customers.

# Extract the vector of cluster assignments from the model
clust_customers <- model_customers$cluster

Append the cluster assignments as a column cluster to the customers_spend data frame and save the results to a new data frame called segment_customers.

# Build the segment_customers data frame
segment_customers <- mutate(customers_spend, cluster = clust_customers)
head(segment_customers)
##   X  Milk Grocery Frozen cluster
## 1 1 11103   12469    902       2
## 2 2  2013    6550    909       1
## 3 3  1897    5234    417       1
## 4 4  1304    3643   3045       1
## 5 5  3199    6986   1455       1
## 6 6  4560    9965    934       1

Calculate the size of each cluster using count().

# Calculate the size of each cluster
count(segment_customers, cluster)
##   cluster  n
## 1       1 35
## 2       2 10
# Calculate the mean for each category
segment_customers %>% 
  group_by(cluster) %>% 
  summarise_all(list(mean))
## # A tibble: 2 x 5
##   cluster     X   Milk Grocery Frozen
##     <int> <dbl>  <dbl>   <dbl>  <dbl>
## 1       1  24.6  2296.    5004  3354.
## 2       2  17.4 13701.   17721  1173

Case Study: National Occupational mean wage

Hierarchical clustering: Occupation trees

In the previous exercise you have learned that the oes data is ready for hierarchical clustering without any preprocessing steps necessary. In this exercise you will take the necessary steps to build a dendrogram of occupations based on their yearly average salaries and propose clusters using a height of 100,000.

EXERCISE:

library(base)
oes <- as.data.frame(readRDS("./Data/oes.rds"))
head(oes)
##                            2001  2002  2003  2004  2005  2006  2007   2008
## Management                70800 78870 83400 87090 88450 91930 96150 100310
## Business Operations       50580 53350 56000 57120 57930 60000 62410  64720
## Computer Science          60350 61630 64150 66370 67100 69240 72190  74500
## Architecture/Engineering  56330 58020 60390 63060 63910 66190 68880  71430
## Life/Physical/Social Sci. 49710 52380 54930 57550 58030 59660 62020  64280
## Community Services        34190 34630 35800 37050 37530 39000 40540  41790
##                             2010   2011   2012   2013   2014   2015   2016
## Management                105440 107410 108570 110550 112490 115020 118020
## Business Operations        67690  68740  69550  71020  72410  73800  75070
## Computer Science           77230  78730  80180  82010  83970  86170  87880
## Architecture/Engineering   75550  77120  79000  80100  81520  82980  84300
## Life/Physical/Social Sci.  66390  67470  68360  69400  70070  71220  72930
## Community Services         43180  43830  44240  44710  45310  46160  47200

Calculate the Euclidean distance between the occupations and store this in dist_oes

# Calculate Euclidean distance between the occupations
dist_oes <- dist(oes, method = "euclidean")

Run hierarchical clustering using average linkage and store in hc_oes

# Generate an average linkage analysis 
hc_oes <- hclust(dist_oes, method = "average")

Create a denrogram object dend_oes from your hclust result using the function as.dendrogram()

# Create a dendrogram object from the hclust variable
dend_oes <- as.dendrogram(hc_oes)

Plot the dendrogram

# Plot the dendrogram
plot(dend_oes)

Using the color_branches() function create & plot a new dendrogram with clusters colored by a cut height of 100,000

# Color branches by cluster formed from the cut at a height of 100000
dend_colored <- color_branches(dend_oes, h = 100000)

# Plot the colored dendrogram
plot(dend_colored)

Hierarchical clustering: Preparing for exploration

You have now created a potential clustering for the oes data, before you can explore these clusters with ggplot2 you will need to process the oes data matrix into a tidy data frame with each occupation assigned its cluster.

EXERCISES

library(tibble)
library(tidyr)
dist_oes <- dist(oes, method = 'euclidean')
hc_oes <- hclust(dist_oes, method = 'average')

Create the df_oes data frame from the oes data.matrix, making sure to store the rowname as a column (use rownames_to_column() from the tibble library)

# Use rownames_to_column to move the rownames into a column of the data frame
df_oes <- rownames_to_column(as.data.frame(oes), var = 'occupation')

Build the cluster assignment vector cut_oes using cutree() with a h = 100,000

# Create a cluster assignment vector at h = 100,000
cut_oes <- cutree(hc_oes, h = 100000)

Append the cluster assignments as a column cluster to the df_oes data frame and save the results to a new data frame called clust_oes

# Generate the segmented the oes data frame
clust_oes <- mutate(df_oes, cluster = cut_oes)

Use the gather() function from the tidyr() library to reshape the data into a format amenable for ggplot2 analysis and save the tidied data frame as gather_oes

# Create a tidy data frame by gathering the year and values into two columns
gathered_oes <- gather(data = clust_oes, 
                       key = year, 
                       value = mean_salary, 
                       -occupation, -cluster)
head(gathered_oes)
##                  occupation cluster year mean_salary
## 1                Management       1 2001       70800
## 2       Business Operations       2 2001       50580
## 3          Computer Science       2 2001       60350
## 4  Architecture/Engineering       2 2001       56330
## 5 Life/Physical/Social Sci.       2 2001       49710
## 6        Community Services       3 2001       34190

Hierarchical clustering: Plotting occupational clusters

You have succesfully created all the parts necessary to explore the results of this hierarchical clustering work. In this exercise you will leverage the named assignment vector cut_oes and the tidy data frame gathered_oes to analyze the resulting clusters.

EXERCISEs:

View the assignments of each occupation to their clustering by sorting the cut_oes vector using sort()

# View the clustering assignments by sorting the cluster assignment vector
sort(cut_oes)
##                 Management                      Legal 
##                          1                          1 
##        Business Operations           Computer Science 
##                          2                          2 
##   Architecture/Engineering  Life/Physical/Social Sci. 
##                          2                          2 
##   Healthcare Practitioners         Community Services 
##                          2                          3 
## Education/Training/Library  Arts/Design/Entertainment 
##                          3                          3 
##         Healthcare Support         Protective Service 
##                          3                          3 
##           Food Preparation  Grounds Cleaning & Maint. 
##                          3                          3 
##              Personal Care                      Sales 
##                          3                          3 
##      Office Administrative   Farming/Fishing/Forestry 
##                          3                          3 
##               Construction Installation/Repair/Maint. 
##                          3                          3 
##                 Production      Transportation/Moving 
##                          3                          3

Use ggplot2 to plot each occupation’s average income by year and color the lines by the occupation’s assigned cluster.

# Plot the relationship between mean_salary and year and color the lines by the assigned cluster
ggplot(gathered_oes, aes(x = year, y = mean_salary, color = factor(cluster))) + 
    geom_line(aes(group = occupation))

Reviewing the HC results

K-means: Elbow analysis

In the previous exercises you used the dendrogram to propose a clustering that generated 3 trees. In this exercise you will leverage the k-means elbow plot to propose the “best” number of clusters.

EXERCISE:

head(oes)
##                            2001  2002  2003  2004  2005  2006  2007   2008
## Management                70800 78870 83400 87090 88450 91930 96150 100310
## Business Operations       50580 53350 56000 57120 57930 60000 62410  64720
## Computer Science          60350 61630 64150 66370 67100 69240 72190  74500
## Architecture/Engineering  56330 58020 60390 63060 63910 66190 68880  71430
## Life/Physical/Social Sci. 49710 52380 54930 57550 58030 59660 62020  64280
## Community Services        34190 34630 35800 37050 37530 39000 40540  41790
##                             2010   2011   2012   2013   2014   2015   2016
## Management                105440 107410 108570 110550 112490 115020 118020
## Business Operations        67690  68740  69550  71020  72410  73800  75070
## Computer Science           77230  78730  80180  82010  83970  86170  87880
## Architecture/Engineering   75550  77120  79000  80100  81520  82980  84300
## Life/Physical/Social Sci.  66390  67470  68360  69400  70070  71220  72930
## Community Services         43180  43830  44240  44710  45310  46160  47200

Use map_dbl() to run kmeans() using the oes data for k values ranging from 1 to 10 and extract the total within-cluster sum of squares value from each model: model$tot.withinss

# Use map_dbl to run many models with varying value of k (centers)
tot_withinss <- map_dbl(1:10,  function(k){
  model <- kmeans(x = oes, centers = k)
  model$tot.withinss
})

Store the resulting vector as tot_withinss Build a new data frame elbow_df containing the values of k and the vector of total within-cluster sum of squares

# Generate a data frame containing both k and tot_withinss
elbow_df <- data.frame(
  k = 1:10,
  tot_withinss = tot_withinss
)
head(elbow_df)
##   k tot_withinss
## 1 1 161931542045
## 2 2  36135001888
## 3 3  22076716754
## 4 4   6527999979
## 5 5   5294165573
## 6 6   2496109577

Use the values in elbow_df to plot a line plot showing the relationship between k and total within-cluster sum of squares

# Plot the elbow plot
ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
  geom_line() +
  scale_x_continuous(breaks = 1:10)

K-means: Average Silhouette Widths

So hierarchical clustering resulting in 3 clusters and the elbow method suggests 2. In this exercise use average silhouette widths to explore what the “best” value of k should be.

EXERCISE:

Use map_dbl() to run pam() using the oes data for k values ranging from 2 to 10 and extract the average silhouette width value from each model: model\(silinfo\)avg.width Store the resulting vector as sil_width

# Use map_dbl to run many models with varying value of k
sil_width <- map_dbl(2:10,  function(k){
  model <- pam(oes, k = k)
  model$silinfo$avg.width
})

Build a new data frame sil_df containing the values of k and the vector of average silhouette widths

# Generate a data frame containing both k and sil_width
sil_df <- data.frame(
  k = 2:10,
  sil_width = sil_width
)
head(sil_df)
##   k sil_width
## 1 2 0.6809574
## 2 3 0.5665783
## 3 4 0.6300043
## 4 5 0.6637260
## 5 6 0.6361872
## 6 7 0.6915586

Use the values in sil_df to plot a line plot showing the relationship between k and average silhouette width

# Plot the relationship between k and sil_width
ggplot(sil_df, aes(x = k, y = sil_width)) +
  geom_line() +
  scale_x_continuous(breaks = 2:10)

It seems that this analysis results in another value of k, this time 7 is the top contender (although 2 comes very close).