Exploratory Data Analysis - Course Project 2

This is an R Markdown for my second Course Project in the Coursera class Exploratory Data Analysis (from the Johns Hopkins University). Code for this project is kept in my GitHub repo here.

Course Project Prompt:

You must address the following questions and tasks in your exploratory analysis. For each question/task you will need to make a single plot. Unless specified, you can use any plotting system in R to make your plot.

Have total emissions from PM2.5 decreased in the United States from 1999 to 2008? Using the base plotting system, make a plot showing the total PM2.5 emission from all sources for each of the years 1999, 2002, 2005, and 2008.
Have total emissions from PM2.5 decreased in Baltimore City, Maryland (fips == “24510”) from 1999 to 2008? Use the base plotting system to make a plot answering this question.
Of the four types of sources indicated by the ðð¢ðð (point, nonpoint, onroad, nonroad) variable, which of these four sources have seen decreases in emissions from 1999â2008 for Baltimore City? Which have seen increases in emissions from 1999â2008? Use the ggplot2 plotting system to make a plot answer this question.
Across the United States, how have emissions from coal combustion-related sources changed from 1999â2008?
How have emissions from motor vehicle sources changed from 1999â2008 in Baltimore City?
Compare emissions from motor vehicle sources in Baltimore City with emissions from motor vehicle sources in Los Angeles County, California (ðððð == “ð¶ð¼ð¶ð¹ð½”). Which city has seen greater changes over time in motor vehicle emissions?

Load R Packages

We begin by loading R packages to simplify the task.

library(dplyr)
library(ggplot2)
library(scales)
library(data.table)

Data Input

The .rds files are located here. Files were unzipped and placed in a folder in my working directory. First step is to read .rds files in as data.frames using base function readRDS.

setwd("~/Documents/Dokumente - MacBook Air/Data Science with R - Johns Hopkins University/C4")
SCC <- readRDS("./exdata-data-NEI_data/Source_Classification_Code.rds")
NEI <- readRDS("./exdata-data-NEI_data/summarySCC_PM25.rds")

Initial Data Exploration

Let’s start with the NEI data. This file contains a data frame with all of the PM2.5 emissions data for 1999, 2002, 2005 and 2008. For each year, the table contains number of tons of PM2.5 emitted from a specific type of source for the entire year. Here are the first few rows.

head(NEI)

##     fips      SCC Pollutant Emissions  type year
## 4  09001 10100401  PM25-PRI    15.714 POINT 1999
## 8  09001 10100404  PM25-PRI   234.178 POINT 1999
## 12 09001 10100501  PM25-PRI     0.128 POINT 1999
## 16 09001 10200401  PM25-PRI     2.036 POINT 1999
## 20 09001 10200504  PM25-PRI     0.388 POINT 1999
## 24 09001 10200602  PM25-PRI     1.490 POINT 1999

Calling str on NEI reveals the types of data in each column.

str(NEI)

## 'data.frame':    6497651 obs. of  6 variables:
##  $ fips     : chr  "09001" "09001" "09001" "09001" ...
##  $ SCC      : chr  "10100401" "10100404" "10100501" "10200401" ...
##  $ Pollutant: chr  "PM25-PRI" "PM25-PRI" "PM25-PRI" "PM25-PRI" ...
##  $ Emissions: num  15.714 234.178 0.128 2.036 0.388 ...
##  $ type     : chr  "POINT" "POINT" "POINT" "POINT" ...
##  $ year     : int  1999 1999 1999 1999 1999 1999 1999 1999 1999 1999 ...

Nearly 6.5M observations! Here are some descriptions of each column:

fips: A five-digit number (represented as a string) indicating the U.S. county
SCC: The name of the source as indicated by a digit string (see source code classification table)
Pollutant: A string indicating the pollutant
Emission: Amount of PM2.5 emitted, in tons
type: The type of source (point, non-point, on-road, or non-road)
year: The year of emissions recorded

Let’s take a look at the other data set, Source Classification Code (SCC) table. This table provides a mapping from the SCC digit strings in the Emissions table to the actual name of the PM2.5 source. The sources are categorized in a few different ways from more general to more specific and you may choose to explore whatever categories you think are most useful. For example, source â10100101â is known as âExt Comb /Electric Gen /Anthracite Coal /Pulverized Coalâ.

head(SCC)

##        SCC Data.Category
## 1 10100101         Point
## 2 10100102         Point
## 3 10100201         Point
## 4 10100202         Point
## 5 10100203         Point
## 6 10100204         Point
##                                                                   Short.Name
## 1                   Ext Comb /Electric Gen /Anthracite Coal /Pulverized Coal
## 2 Ext Comb /Electric Gen /Anthracite Coal /Traveling Grate (Overfeed) Stoker
## 3       Ext Comb /Electric Gen /Bituminous Coal /Pulverized Coal: Wet Bottom
## 4       Ext Comb /Electric Gen /Bituminous Coal /Pulverized Coal: Dry Bottom
## 5                   Ext Comb /Electric Gen /Bituminous Coal /Cyclone Furnace
## 6                   Ext Comb /Electric Gen /Bituminous Coal /Spreader Stoker
##                                EI.Sector Option.Group Option.Set
## 1 Fuel Comb - Electric Generation - Coal                        
## 2 Fuel Comb - Electric Generation - Coal                        
## 3 Fuel Comb - Electric Generation - Coal                        
## 4 Fuel Comb - Electric Generation - Coal                        
## 5 Fuel Comb - Electric Generation - Coal                        
## 6 Fuel Comb - Electric Generation - Coal                        
##                 SCC.Level.One       SCC.Level.Two               SCC.Level.Three
## 1 External Combustion Boilers Electric Generation               Anthracite Coal
## 2 External Combustion Boilers Electric Generation               Anthracite Coal
## 3 External Combustion Boilers Electric Generation Bituminous/Subbituminous Coal
## 4 External Combustion Boilers Electric Generation Bituminous/Subbituminous Coal
## 5 External Combustion Boilers Electric Generation Bituminous/Subbituminous Coal
## 6 External Combustion Boilers Electric Generation Bituminous/Subbituminous Coal
##                                  SCC.Level.Four Map.To Last.Inventory.Year
## 1                               Pulverized Coal     NA                  NA
## 2             Traveling Grate (Overfeed) Stoker     NA                  NA
## 3 Pulverized Coal: Wet Bottom (Bituminous Coal)     NA                  NA
## 4 Pulverized Coal: Dry Bottom (Bituminous Coal)     NA                  NA
## 5             Cyclone Furnace (Bituminous Coal)     NA                  NA
## 6             Spreader Stoker (Bituminous Coal)     NA                  NA
##   Created_Date Revised_Date Usage.Notes
## 1                                      
## 2                                      
## 3                                      
## 4                                      
## 5                                      
## 6

Calling str on SCC gives an overview of the data.frame structure.

str(SCC)

## 'data.frame':    11717 obs. of  15 variables:
##  $ SCC                : Factor w/ 11717 levels "10100101","10100102",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Data.Category      : Factor w/ 6 levels "Biogenic","Event",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Short.Name         : Factor w/ 11238 levels "","2,4-D Salts and Esters Prod /Process Vents, 2,4-D Recovery: Filtration",..: 3283 3284 3293 3291 3290 3294 3295 3296 3292 3289 ...
##  $ EI.Sector          : Factor w/ 59 levels "Agriculture - Crops & Livestock Dust",..: 18 18 18 18 18 18 18 18 18 18 ...
##  $ Option.Group       : Factor w/ 25 levels "","C/I Kerosene",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Option.Set         : Factor w/ 18 levels "","A","B","B1A",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ SCC.Level.One      : Factor w/ 17 levels "Brick Kilns",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ SCC.Level.Two      : Factor w/ 146 levels "","Agricultural Chemicals Production",..: 32 32 32 32 32 32 32 32 32 32 ...
##  $ SCC.Level.Three    : Factor w/ 1061 levels "","100% Biosolids (e.g., sewage sludge, manure, mixtures of these matls)",..: 88 88 156 156 156 156 156 156 156 156 ...
##  $ SCC.Level.Four     : Factor w/ 6084 levels "","(NH4)2 SO4 Acid Bath System and Evaporator",..: 4455 5583 4466 4458 1341 5246 5584 5983 4461 776 ...
##  $ Map.To             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Last.Inventory.Year: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Created_Date       : Factor w/ 57 levels "","1/27/2000 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Revised_Date       : Factor w/ 44 levels "","1/27/2000 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Usage.Notes        : Factor w/ 21 levels ""," ","includes bleaching towers, washer hoods, filtrate tanks, vacuum pump exhausts",..: 1 1 1 1 1 1 1 1 1 1 ...

Question #1

The first question is: Have total emissions from PM2.5 decreased in the United States from 1999 to 2008? Using the base plotting system, make a plot showing the total PM2.5 emission from all sources for each of the years 1999, 2002, 2005, and 2008.

-Yes, total PM2.5 emissions decreased in the United States from 1999 to 2008.

To answer this, we first make a data.frame called aggregateByYear that uses dplyr group_by, filter, and summarize to add up the tons of emissions for each of the four years. Then we make a numeric vector called pts by dividing the annual emissions by 1,000,000 so that we can plot the annual totals in millions of tons. We then call plot with the arguments below. Of note, we divide the y-value by 1,000,000 to be consistent with the y-axis labels. Also, we set axes = FALSE the call to plot because we are going to manually plot them later. The calls to axis(1) and axis(2...) plots the x- and y-axes respectively and specifies the locations of the tick marks (equal to yrs and pts) and for the y-axis creates the labels by appending an M (for millions) using paste.

A note about labels in base plots: you can create subscripts or superscripts using expression in the label call (ie, main, ylab, etc) with square brackets or a caret, respectively. This is what the [2.5] is doing in the ylab argument in the R chunk below. Since this subscript occurs in the middle of the y-axis label, we use an asterisk to break up the label before the subscript and the label after the subscript.

aggregateByYear <- aggregate(NEI$Emissions, by=list(Category=NEI$year), FUN=sum)
barplot(aggregateByYear$x, names.arg=aggregateByYear$Category, ylab="Total Tons of PM2.5 Emissions", xlab="Year",
        main="Total Tons of PM2.5 Emissions in the United States")

As we can see from this graph, there has been a decline in total tons of PM2.5 emissions in the United States over these 4 years.

Question #2

Have total emissions from PM2.5 decreased in Baltimore City, Maryland (fips == “24510”) from 1999 to 2008? Use the base plotting system to make a plot answering this question.

Yes, as we can see from the graph below, there is an absolute decrease from the year 1999 to the year 2008 in total PM2.5 emissions in Baltimore City; however, it is not a consistently negative trend as indicated by the increase from 2002 to 2005.

NEI_BC <- NEI[NEI$fips == "24510",]
aggregateByYearBaltimoreCity <- aggregate(NEI_BC$Emissions, by=list(Category=NEI_BC$year), FUN=sum)
barplot(aggregateByYearBaltimoreCity$x, names.arg=aggregateByYearBaltimoreCity$Category,
        ylab="Total PM2.5 emission in Baltimore City", xlab="Year", main="Emissions per year in Baltimore City")

Question #3

Of the four types of sources indicated by the ðð¢ðð (point, nonpoint, onroad, nonroad) variable, which of these four sources have seen decreases in emissions from 1999â2008 for Baltimore City? Which have seen increases in emissions from 1999â2008? Use the ggplot2 plotting system to make a plot answer this question.

My answer to this question is: on-road, non-road, and non-point sources have all seen decreases in PM2.5 emissions from 1999-2008 whereas point sources have seen an increase over the same time period.

aYBC_Type <- NEI_BC %>% group_by(year, type) %>% summarise(Emissions = sum(Emissions))

## `summarise()` regrouping output by 'year' (override with `.groups` argument)

ggplot(aYBC_Type, aes(x=type, y=Emissions, fill=factor(year))) +
       geom_bar(stat="identity", position="dodge") +
       scale_fill_discrete((name="Year")) +
       xlab("Type") +
       ylab("Emissions") + 
       ggtitle("Emissions per year and Type in Baltimore City [1999 to 2008]")

Question #4

Across the United States, how have emissions from coal combustion-related sources changed from 1999â2008?

Emissions from coal combustion-relates sources have decreased from 1999-2008, especially from 2005 to 2008.

To show this graphically, we must first subset the SCC data.frame by using grep to pattern match on the string “Fuel Comb”, followed by any number of characters, followed by “Coal” in the EI.Sector column. To determine this was the right pattern, we use unique(SCC$EI.Sector) to give us all 59 levels of this factor (not shown in R chunk below). Subsetting SCC in this way will give us a smaller data.frame with only coal combustion-related source classification codes called scc.coal. From here, we make a list of unique coal-related source classification codes called scc.coal.list. This will be our lookup list to subset the NEI data.frame using the %in% operator, which we save in a data.frame called nei.coal.

From here, we use facets to show coal combustion-related emissions by type. We create a new type called “TOTAL” which is simply an annual total of all coal combustion-related emissions. This way we can show the annual total in the leftmost facet, then the breakout by type in the middle and right facets. Similar to Question #3, we had to re-order the factor levels so that the facets would plot in the order we wish.

coalCombustion <- filter(SCC, grepl("Fuel Comb.*Coal", EI.Sector))
coalCombustion.list <- unique(coalCombustion$SCC)
NEI.coalCombustion <- subset(NEI, SCC %in% coalCombustion.list)
NEI.coalCombustion <- NEI.coalCombustion %>% group_by(type, year) %>% summarize(Annual.Total = sum(Emissions))

## `summarise()` regrouping output by 'type' (override with `.groups` argument)

NEI.coalCombustion.total <- NEI.coalCombustion %>% group_by(year) %>% summarize(Annual.Total = sum(Annual.Total)) %>%
        mutate(type = "TOTAL")

## `summarise()` ungrouping output (override with `.groups` argument)

NEI.coalCombustion <- NEI.coalCombustion %>% select(Annual.Total, type, year)
NEI.coalCombustion <- bind_rows(NEI.coalCombustion, NEI.coalCombustion.total)
NEI.coalCombustion$type <- factor(NEI.coalCombustion$type, levels = c("TOTAL", "ON-ROAD", "NON-ROAD", "POINT", "NONPOINT"))
ggplot(NEI.coalCombustion, aes(x = factor(year), y = Annual.Total, fill = type)) +
        geom_bar(stat = "identity") +
        facet_grid(. ~ type) +
        xlab("year") +
        ylab(expression("Total Tons of PM2.5 Emissions")) +
        ggtitle(expression(atop("Total Tons of PM2.5 Emissions in the US", paste("from Coal Combustion-Related Sources")))) +
        theme(plot.margin = unit(c(1, 1, 1, 1), "cm")) +
        scale_y_continuous(labels = comma) +
        scale_fill_brewer(palette = "Dark2") +
        guides(fill = FALSE)

Question #5

How have emissions from motor vehicle sources changed from 1999â2008 in Baltimore City?

My answer to this question is: emissions from motor vehicle sources have declined from 1999-2008 in Baltimore City.

To arrive at this conclusion, we must first subset the NEI data.frame by the source classification codes for motor vehicle sources. We do this by using grep to pattern match the string “Mobile” followed by any number of characters, followed by the string “Vehicles”.

scc.vehicles <- SCC[grep("Mobile.*Vehicles", SCC$EI.Sector),  ]; # Pattern match mobile vehicles in SCC description
scc.vehicles.list <- unique(scc.vehicles$SCC); # Create motor vehicle lookup list by SCC
nei.vehicles <- subset(NEI, SCC %in% scc.vehicles.list); # Filter for motor vehicle sources
nei.vehicles <- nei.vehicles %>% filter(fips == "24510") # Filter for Baltimore
nei.vehicles <- merge(x = nei.vehicles, y = scc.vehicles[, c("SCC", "SCC.Level.Two", "SCC.Level.Three")], by = "SCC") # Join in descriptive data on SCC codes
nei.vehicles <- nei.vehicles %>% group_by(year, SCC.Level.Two, SCC.Level.Three) %>% summarize(Annual.Total = sum(Emissions))
nei.vehicles.total <- nei.vehicles %>% group_by(year) %>% summarize(Annual.Total = sum(Annual.Total)) %>% mutate(SCC.Level.Two = "Total")
nei.vehicles <- bind_rows(nei.vehicles, nei.vehicles.total);
nei.vehicles$SCC.Level.Two <- factor(nei.vehicles$SCC.Level.Two, levels = c("Total", "Highway Vehicles - Diesel", "Highway Vehicles - Gasoline"));
ggplot(nei.vehicles, aes(x = factor(year), y = Annual.Total, fill = SCC.Level.Two)) +
  geom_bar(stat = "identity") +
  facet_grid(. ~ SCC.Level.Two) +
  xlab("Year") +
  ylab(expression("Total Tons of PM"[2.5]*" Emissions")) + 
  ggtitle(expression(atop("Total Tons of PM"[2.5]*" Emissions in Baltimore City", paste("from Motor Vehicle Sources")))) +
  theme(plot.title = element_text(hjust = 0.5)) + # Center the plot title
  theme(plot.margin = unit(c(1,1,1,1), "cm")) + # Adjust plot margins
  scale_fill_brewer(palette = "Set1") +
  guides(fill = FALSE)

Question #6

Compare emissions from motor vehicle sources in Baltimore City with emissions from motor vehicle sources in Los Angeles County, California (ðððð == “ð¶ð¼ð¶ð¹ð½”). Which city has seen greater changes over time in motor vehicle emissions?

My answer to this question is: Baltimore City has seen a negative 8% compound annual growth rate (CAGR) in emissions from motor vehicles over the 9 years of data whereas Los Angeles has experienced a positive 0.5% CAGR over the same period. The data indicate that motor vehicle emissions have declined in Baltimore City and increased in Los Angeles.

scc.vehicles <- SCC[grep("Mobile.*Vehicles", SCC$EI.Sector),  ]; # Pattern match mobile vehicles in SCC description
scc.vehicles.list <- unique(scc.vehicles$SCC); # Create motor vehicle lookup list by SCC
nei.vehicles <- subset(NEI, SCC %in% scc.vehicles.list); # Filter for motor vehicle sources
nei.vehicles <- nei.vehicles %>% filter(fips == "24510"| fips == "06037"); # Filter for Baltimore City or Los Angeles County
nei.vehicles$fips[nei.vehicles$fips == "24510"] <- "Baltimore";
nei.vehicles$fips[nei.vehicles$fips == "06037"] <- "Los Angeles";
nei.vehicles <- merge(x = nei.vehicles, y = scc.vehicles[, c("SCC", "SCC.Level.Two")], by = "SCC"); # Join in descriptive data on SCC codes
nei.vehicles <- nei.vehicles %>% group_by(fips, year, SCC.Level.Two) %>% summarize(Annual.Total = sum(Emissions));
nei.vehicles.total <- nei.vehicles %>% group_by(fips, year) %>% summarize(Annual.Total = sum(Annual.Total)) %>% mutate(SCC.Level.Two = "Total");
nei.vehicles <- bind_rows(nei.vehicles, nei.vehicles.total);
nei.vehicles$SCC.Level.Two <- factor(nei.vehicles$SCC.Level.Two, levels = c("Total", "Highway Vehicles - Diesel", "Highway Vehicles - Gasoline"));
ggplot(nei.vehicles, aes(x = factor(year), y = Annual.Total, fill = SCC.Level.Two)) +
  geom_bar(stat = "identity") +
  facet_grid(fips ~ SCC.Level.Two) + 
  xlab("Year") +
  ylab(expression("Total Tons of PM"[2.5]*" Emissions")) + 
  ggtitle(expression(atop("Total Tons of PM"[2.5]*" Emissions from Motor Vehicle Sources", paste("in Baltimore City, MD and Los Angeles County, CA")))) +
  theme(plot.title = element_text(hjust = 0.5)) + # Center the plot title
  theme(plot.margin = unit(c(1,1,1,1), "cm")) + # Adjust plot margins
  scale_fill_brewer(palette = "Set1") +
  guides(fill = FALSE)

scc.vehicles <- SCC[grep("Mobile.*Vehicles", SCC$EI.Sector),  ]; # Pattern match mobile vehicles in SCC description
scc.vehicles.list <- unique(scc.vehicles$SCC); # Create motor vehicle lookup list by SCC
nei.vehicles <- subset(NEI, SCC %in% scc.vehicles.list); # Filter for motor vehicle sources
nei.vehicles <- nei.vehicles %>% filter(fips == "24510"| fips == "06037"); # Filter for Baltimore City or Los Angeles County
nei.vehicles$fips[nei.vehicles$fips == "24510"] <- "Baltimore";
nei.vehicles$fips[nei.vehicles$fips == "06037"] <- "Los Angeles";
nei.vehicles <- merge(x = nei.vehicles, y = scc.vehicles[, c("SCC", "SCC.Level.Two")], by = "SCC"); # Join in descriptive data on SCC codes
nei.vehicles <- nei.vehicles %>% group_by(fips, year, SCC.Level.Two) %>% summarize(Annual.Total = sum(Emissions));
nei.vehicles.total <- nei.vehicles %>% group_by(fips, year) %>% summarize(Annual.Total = sum(Annual.Total)) %>% mutate(SCC.Level.Two = "Total");
nei.vehicles <- bind_rows(nei.vehicles, nei.vehicles.total);
nei.vehicles$SCC.Level.Two <- factor(nei.vehicles$SCC.Level.Two, levels = c("Total", "Highway Vehicles - Diesel", "Highway Vehicles - Gasoline"));
ggplot(nei.vehicles, aes(x = factor(year), y = Annual.Total, fill = SCC.Level.Two)) +
  geom_bar(stat = "identity") +
  facet_grid(fips ~ SCC.Level.Two, scales = "free") + # Setup facets and allow scales to adjust to data in each location
  xlab("Year") +
  ylab(expression("Total Tons of PM"[2.5]*" Emissions")) + 
  ggtitle(expression(atop("Total Tons of PM"[2.5]*" Emissions from Motor Vehicle Sources", paste("in Baltimore City, MD and Los Angeles County, CA")))) +
  theme(plot.title = element_text(hjust = 0.5)) + # Center the plot title
  theme(plot.margin = unit(c(1,1,1,1), "cm")) + # Adjust plot margins
  scale_fill_brewer(palette = "Set1") +
  guides(fill = FALSE)

It’s still not very clear which has a greater change in percentage terms. We can use data.table function called shift to help us calculate percent change between measurement years.

nei.vehicles.DT <- data.table(nei.vehicles)
yoyFunc <- function(x) {x/shift(x)}
yoy.cols <- c("Annual.Total")
nei.vehicles.DT <- nei.vehicles.DT[, paste0("Percent.Change.", yoy.cols) := lapply(.SD, yoyFunc), by = "fips,SCC.Level.Two", .SDcols = yoy.cols]
nei.vehicles.DT <- mutate(nei.vehicles.DT, Percent.Change.Annual.Total = Percent.Change.Annual.Total - 1)
ggplot(nei.vehicles.DT, aes(x = factor(year), y = Percent.Change.Annual.Total, fill = SCC.Level.Two)) +
  geom_bar(stat = "identity") +
  facet_grid(fips ~ SCC.Level.Two) +
  xlab("Year") +
  ylab(expression("% Change From Prior Measurement")) + 
  ggtitle(expression(atop("Percentage Change in Total Tons of PM"[2.5]*" Emissions from Motor Vehicle", paste("Sources in Baltimore City, MD and Los Angeles County, CA")))) +
  theme(plot.title = element_text(hjust = 0.5)) + # Center the plot title
  theme(plot.margin = unit(c(1,1,1,1), "cm")) + # Adjust plot margins
  scale_fill_brewer(palette = "Set1") +
  guides(fill = FALSE)

As we can see from this bar chart, Baltimore City appears to have a more meaningful decline in PM2.5 emissions from motor vehicle sources than Los Angeles County. It is tempting to use use summary to look at the Percent.Change.Annual.Total in each location to find the average Percent.Change.Annual.Total; however, this is not correct. The correct way to do this is to calculate the growth rate, defined as:
\[\left[\left[\frac{\textrm{Ending Quantity - Beginning Quantity}}{\textrm{Number of Years}}\right]\div\textrm{Beginning Quantity}\right]\times\textrm{100}\]
We can do this easily using dplyr.

CAGR.df <- nei.vehicles.DT %>% 
  group_by(fips, SCC.Level.Two) %>% 
  summarize(N.Years = max(year) - min(year), 
            Beginning.Qty = Annual.Total[which(year==min(year))],
            Ending.Qty = Annual.Total[which(year==max(year))],
            CAGR = ((Ending.Qty-Beginning.Qty)/N.Years)/Beginning.Qty);
CAGR.df;

## # A tibble: 6 x 6
## # Groups:   fips [2]
##   fips        SCC.Level.Two            N.Years Beginning.Qty Ending.Qty     CAGR
##   <chr>       <fct>                      <int>         <dbl>      <dbl>    <dbl>
## 1 Baltimore   Total                          9          347.       88.3 -0.0828 
## 2 Baltimore   Highway Vehicles - Dies…       9          216.       40.6 -0.0902 
## 3 Baltimore   Highway Vehicles - Gaso…       9          131.       47.7 -0.0707 
## 4 Los Angeles Total                          9         3931.     4101.   0.00481
## 5 Los Angeles Highway Vehicles - Dies…       9         1639.     2109.   0.0319 
## 6 Los Angeles Highway Vehicles - Gaso…       9         2292.     1992.  -0.0145

As we can see from using summary below, the annual growth rate is overstated by taking a simple mean of the Percent.Change.Annual.Total. First for Baltimore:

summary(nei.vehicles.DT$Percent.Change.Annual.Total[nei.vehicles.DT$fips=="Baltimore"]);

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
## -0.65100 -0.54984 -0.32320 -0.31266 -0.09681  0.03171        3

Now for Los Angeles:

summary(nei.vehicles.DT$Percent.Change.Annual.Total[nei.vehicles.DT$fips=="Los Angeles"]);

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
## -0.17953 -0.10868  0.04257  0.03217  0.08723  0.46027        3

This is why we must use CAGR when evaluating year-over-year growth (or decline).