Introduction

This report was prepared as the final project for the Data Visualization Capstone course in the Coursera and Johns Hopkins University specialization, Data Visualization and Dashboarding in R. The project requirements are as follows:

The final project is the culmination of everything you have learned in this specialization. Students will create an R Markdown file that includes a series of compelling visualizations.

The final project should contain eight polished and complete data-driven graphics based on the topic chosen by the learner in the portfolio. There should be at least three different types of figure. At least one figure should be either animated or interactive, using gganimate, shiny, or plotly.

The primary emphasis is on the quality of the individual graphics. The most important features of a final published graphic are its clarity in delivering a rich amount of information.

Make sure that your graphics:

  1. Include a title that describes the figure

  2. Have appropriate and clear axis labels and legends

  3. Include a variety of figure types. Use at least 3 figure types (e.g., some combination of box plots, line plots, bar charts, heatmaps, spatial figures, dumbbell charts, etc.).

For this report, I download commercial fishing vessel license data from the State of Alaska’s Commercial Fisheries Entry Commission (covering 1978-2021), combine it into a single data set, remove non-US home port locations (where the vessel is stored), and fix known errors in the state data (where Alaska is abbreviated AL instead of AK). Behind the scenes, my R code checks to see if the data folder already has the cleaned data and, if not, performs all of this processing. With this file check, the data processing only needs to occur on the first knit.

This report includes 8 figures and 7 types of figures:

Each figure was formatted to fit as closely as possible with the others, though there were limitations due to the packages used. Please note that ggplotly and gganimate override some options in ggplot2, making it more challenging to produce consistent figures. This is a lesson learned, and is a good reason to not vary the packages you use too much. Also, Figure 4 uses shading that I selected to work with the color scheme but is unique in that no other figures use those colors. The figures uses papayawhip for the panel background, tan4 for the main data color, and tan2 as a secondary color and fill. The tan color scheme is intended to have a similar feel as some business publications.

This report was generated in Rmarkdown using RStudio. The project allowed students to either publish a report on RPubs or a Shiny app. I chose a traditional report format, since it is more consistent with the types of visualizations I produce in my work. For this assignment, I created a new look for the graphics rather than use the code I’ve been developing for use in my work.

Vessel Counts, Ownership, and Home Ports

The first set of figures looks at vessel counts, ownership, and home ports. A vessel’s home port is technically where the vessel is moored when not in use, though sometimes a symbolic home port is reported.

Number of vessels over time, by owner residency state

Figure 1 looks at vessel participation over time for the top 9 states by vessel owner residency. As we see in the figure, Alaska residents owned the most vessels, followed by Washington, Oregon, and California residents.

Tech Talk: Behind the scenes, I summarize the data to get the top 9 states by owner residency, filter the data set to only include those states, and then plot the data using ggplot and a facet_wrap. I used a free_y scale since otherwise Alaska would dwarf the other states.

Number of vessels over time, by home port state

Figure 2 looks at vessel participation over time for the top 9 states by home port. As we see in the figure, most vessels were home ported in Alaska, followed by Washington. In contrast to ownership, home port locations drop off faster with distance, since cruising to and from Alaska in a vessel is more costly than flying crew in a plane.

Tech Talk: Behind the scenes, I summarize the data to get the top 9 states by home port, filter the data set to only include those states, and then plot the data using ggplot and a facet_wrap. I used a free_y scale since otherwise Alaska would dwarf the other states.

Comparison of ownership and home port states, 2021

Figure 3 is a bubble chart “matrix” that compares the owner’s residency with the vessel’s home port. As we see in the figure, most vessels were home ported in Alaska, regardless of residency, followed by Washington. The line of bubbles along the bottom are home ported in Alaska but owned anywhere and the line of bubbles just below the top are home ported in Washington by owned anywhere. This is an interactive chart; hovering over the points will tell you the owner state, home port state, and vessel count.

Tech Talk: Behind the scenes, I change variable names a bit so that they would look nice in the interactive chart and then plot the data using ggplot

Map of owners by state, 2021

Figure 4 is a map of ownership by state. The lack of geographical diversity is quite apparent!

Tech Talk: Behind the scenes, I summarize the data by state, add in missing states with values of 0 (since they otherwise would not appear in the map), and plot the map using geom_map via the fiftystater package. This was the simplest option for creating a U.S. map that would include Alaska and Hawaii.

Vessel Age

The next chart looks at the age of commercial fishing vessels operating in Alaska.

Aging of the fleet over time

Figure 5 is a ridgeline plot of vessel age distribution over time. The fairly consistent ridges heading vertically from the bottom left to the top right shows that there has not been a lot of retirement of older vessels or entry of new vessels, and vessels have mostly just aged over time. There was some reduction in age in the early years of the chart (1978 to about 1990), along with some new entrants of younger in 1990. There were a few other newer entrants around 2000 and 2010-2020, but those were in the minority.

Tech Talk: Behind the scenes, I filter out vessels with missing year_built, calculate the age, and remove any ages that were 0 or lower. I then plot the data using the ggridges package.

Vessel Characteristics

The final set of charts looks at other vessel characteristics.

Vessel lengths over time

Figure 6 is an animated plot that combines a histogram and a density plot of vessel length over time from 1978 to 2021. The figure shows two common vessel lengths emerging over time, with 32-foot vessels dominating after 2000.

Tech Talk: Behind the scenes, I filter out vessels with lengths of 1 or lower, plot the data using geom_hist and geom_density, and animate it by cycling through the years using gganimate.

Vessel size by owner location

Figure 7 looks at vessel size (length) distribution by the owner’s residence, using 2021 data. It is interesting to note that the median size is virtually identical (at 32 feet) but there are differences in the distribution by state. Washington owners have the largest vessels, while Alaska residents have more vessels up to the 58-foot limit for salmon seiners.

Tech Talk: Behind the scenes, I create a list of states with at least 10 vessel owners that I use to filter the data. The reduces the data to 13 states, which causes a busy but readable chart. I use geom_jitter first, followed by geom_boxplot, to keep the boxes clear but show the underlying distribution. I turned off the outliers on the boxplot.

Vessel engine Characteristics

Figure 8, our final figure, looks at the relationship between vessel horsepower and fuel capacity, also adding in the engine type. Diesel engines are more prevalent and for the most part produce more horsepower than gasoline engines. Diesel engines also have higher fuel capacity in general. The two regression lines on the shart show the relationship between horsepower and fuel capacity by engine type. This is an interactive chart; hovering over the points will tell you the horsepower, fuel capacity, and engine type for each point, as well as estimates of each for each line.

Tech Talk: Behind the scenes, I filter out 0 or 1 values for horsepower and fuel capacity, rename variables for better display in the interactive chart, plot the data using geom_point and geom_smooth, and display it with ggplotly.

References

There are too many references to count, but the following sites were helpful for some more specialized aspects of my code.