There are thousands of museums, aquariums, and zoos across the United States. In this project, I’ll take a look at the distribution of these institutions by geographic region, type, and revenue.
My data was compiled from administrative records from the Institute of Museum and Library Services, IRS records, and private foundation grantmaking records. This data reflects the status of each institution as of 2013. For each institution, I had information on its name, type, and location. Each institution also has a parent organization – for example, if a museum housed at a university, its parent organization is the university where it resides. Financial data on annual revenue was available at the parent organization level.
library(dplyr)
library(ggplot2)
library(stringr)
library(tidyr)
library(plotrix)
library(readr)
I started by loading my dataset, a file named museums.csv. I loaded this file into a data frame named museums_df.
# Load file as data frame
museums_df <- read_csv('museums.csv')
# Inspect data frame
head(museums_df, 10)
## # A tibble: 10 × 8
## Museum.Name Legal…¹ Museu…² State…³ Regio…⁴ Is.Mu…⁵ Tax.Y…⁶ Annua…⁷
## <chr> <chr> <chr> <chr> <dbl> <lgl> <dbl> <dbl>
## 1 ALASKA AVIATION HERI… ALASKA… HISTOR… AK 6 TRUE 2013 1100472
## 2 ALASKA BOTANICAL GAR… ALASKA… ARBORE… AK 6 FALSE 2013 1323742
## 3 ALASKA CHALLENGER CE… ALASKA… SCIENC… AK 6 TRUE 2013 729080
## 4 ALASKA HERITAGE MUSE… ALASKA… HISTOR… AK 6 TRUE 2013 1100472
## 5 ALASKA JEWISH MUSEUM ALASKA… GENERA… AK 6 TRUE 2013 68748
## 6 ALASKA LIGHTHOUSE AS… ALASKA… HISTOR… AK 6 FALSE 2013 16500
## 7 ALASKA SPORTS HALL O… ALASKA… GENERA… AK 6 TRUE 2013 615273
## 8 ALASKA STATE MUSEUM FRIEND… GENERA… AK 6 TRUE 2013 129192
## 9 ALASKA TROOPER MUSEUM FRATER… GENERA… AK 6 TRUE 2013 221807
## 10 ALASKA WILDLIFE CONS… ALASKA… ARBORE… AK 6 FALSE 2013 2793744
## # … with abbreviated variable names ¹Legal.Name, ²Museum.Type,
## # ³State..Administrative.Location., ⁴Region.Code..AAM., ⁵Is.Museum,
## # ⁶Tax.Year, ⁷Annual.Revenue
colnames(museums_df)
## [1] "Museum.Name" "Legal.Name"
## [3] "Museum.Type" "State..Administrative.Location."
## [5] "Region.Code..AAM." "Is.Museum"
## [7] "Tax.Year" "Annual.Revenue"
The Museum.Name column represents the name of each individual institution, while the Legal.Name column represents the name of each institution’s parent entity.
In this section, I explored the distribution of institutions in my data set by type. My data frame contained a column called Museum.Type describing what kind of museum each location is – a history museum, a zoo, an aquarium, etc.
I created and printed a bar plot called museum_type that mapped Museum.Type to the x axis and counts the frequency of each type on the y axis.
Which category is most common?
# Create and print bar plot by type
museum_type <-
ggplot(data = museums_df, aes(x = Museum.Type)) +
geom_bar()
museum_type
The plot was hard to read because the categories were long. I added a scale_x_discrete() layer to customize my x axis, using the function scales::wrap_format(8) to reformat my labels.
wrap_format() is a function from the scales packages which comes included with ggplot2. By setting the value of wrap_format() to 8, I was telling it that the maximum width per line should be no more than 8 characters.
museum_type <-
ggplot(museums_df, aes(x = Museum.Type)) +
geom_bar() +
scale_x_discrete(
labels = scales::wrap_format(8))
museum_type
There is a boolean (TRUE or FALSE) column the data frame called Is.Museum. The TRUE category includes typical museums like art, history, and science museums. The FALSE category includes zoos, aquariums, nature preserves, and historic sites, which are included in this data but aren’t what most people think of when they hear the word “museum.”
I create a new bar plot called museum_class, mapping Is.Museum to the x axis. Since “TRUE” and “FALSE” weren’t very descriptive, I used scale_x_discrete() to rename the x axis labels to more easily understood terms – “Museum” vs “Non-Museum”.
# Create and print bar plot by museum vs non-museum
museum_class <-
ggplot(museums_df, aes(x = Is.Museum)) +
geom_bar() +
scale_x_discrete(
labels = c(
"TRUE" = "Museum",
"FALSE" = "Non-Museum"))
museum_class
Instead of looking at the distribution across the entire United States, I filter museums_df to include a few states I was interested in, using the State..Administrative.Location. column; I choose IL, CA, and NY. I called this filtered data frame museums_states.
After creating museums_states, I recreated the bar plot showing the distribution of museums vs non-museums and I used facet_grid() to display each state’s distribution in a separate panel. I called this plot museum_facet.
How does the distribution of museum vs non-museum vary across the states you chose?
# Filter data frame to select states
museums_states <- museums_df %>%
filter(
grepl('IL|CA|NY',
State..Administrative.Location.))
# Create and print bar plot with facets
museum_facet <-
ggplot(museums_states, aes(x = Is.Museum)) +
geom_bar() +
scale_x_discrete(
labels = c(
"TRUE" = "Museum",
"FALSE" = "Non-Museum")) +
facet_grid(
cols = vars(State..Administrative.Location.))
museum_facet
Data also contains information on each museum’s region, representing groups of states. I created a stacked bar plot using museums_df showing the count of museums by region (Region.Code..AAM.), mapping Is.Museum to the fill aesthetic. I Convert Region.Code..AAM. to a factor (e.g. factor(Region.Code..AAM.)) so ggplot2 plots its levels as discrete rather than continuous values. I called this plot museum_stacked.
museum_stacked <-
ggplot(museums_df,
aes(
x = factor(Region.Code..AAM.),
fill = Is.Museum)) +
geom_bar()
museum_stacked
The plot was hard to read – because it was not possible to know what the region numbers correspond to. I used scale_x_discrete() to rename the numeric labels to text according to the following table.
| Code | Region |
|---|---|
| 1 | New England |
| 2 | Mid-Atlantic |
| 3 | Southeastern |
| 4 | Midwest |
| 5 | Mountain Plains |
| 6 | Western |
Similarly, I added a scale_fill_discrete() layer to relabel the “TRUE” and “FALSE” labels in legend to “Museum” and “Non-Museum”.
museum_stacked <-
ggplot(museums_df,
aes(
x = factor(Region.Code..AAM.),
fill = Is.Museum)) +
geom_bar() +
scale_x_discrete(
labels = c(
"1"="New England",
"2"="Mid-Atlantic",
"3"="Southeastern",
"4"="Midwest",
"5"="Mountain Plains",
"6"="Western")) +
scale_fill_discrete(
labels = c(
"TRUE" = "Museum",
"FALSE" = "Non-Museum"))
museum_stacked
I transformed the plot to a stacked bar plot showing values out of 100% by passing position = “fill” to geom_bar() layer. I applied the scales::percent_format() function to transform y axis labels into percentage values.
museum_stacked <-
ggplot(museums_df,
aes(
x = factor(Region.Code..AAM.),
fill = Is.Museum)) +
geom_bar(position = "fill") +
scale_x_discrete(
labels = c(
"1"="New England",
"2"="Mid-Atlantic",
"3"="Southeastern",
"4"="Midwest",
"5"="Mountain Plains",
"6"="Western")) +
scale_fill_discrete(
labels = c(
"TRUE" = "Museum",
"FALSE" = "Non-Museum")) +
scale_y_continuous(
labels = scales::percent_format())
museum_stacked
Graph looked pretty good at this point! However, axes titles were a little non-descriptive. Using the labs() layer, I created the title “Museum Types by Region”, relabeled the x axis title as “Region”, the y axis title as “Percentage of Total”, and the fill legend title as “Type”.
museum_stacked <-
ggplot(museums_df,
aes(
x = factor(Region.Code..AAM.),
fill = Is.Museum)) +
geom_bar(position = "fill") +
scale_x_discrete(
labels = c(
"1"="New England",
"2"="Mid-Atlantic",
"3"="Southeastern",
"4"="Midwest",
"5"="Mountain Plains",
"6"="Western")) +
scale_fill_discrete(
labels = c(
"TRUE" = "Museum",
"FALSE" = "Non-Museum")) +
scale_y_continuous(
labels = scales::percent_format()) +
labs(
title = "Museum Types by Region",
x = "Region",
y = "Percentage of Total",
fill = "Type"
)
museum_stacked
## Data Exploration
Explore Institutions by Revenue
In this part I looked at how much money each institution brought in and how that varied across geographies. Because I only had revenue data at the parent organization level, I first filtered the data set to omit any duplicates. Next, I created a few data frames from the starting data to look at different groups of museums by how much money they bring in.
I created a new data frame called museums_revenue_df that retains only unique values of Legal.Name in museums_df. Additionally, I filtered this data frame to include only entities with Annual.Revenue greater than 0.
I created a second data frame from museums_revenue_df called museums_small_df that retains only museums with Annual.Revenue less than $1,000,000.
I created a third data frame from museums_revenue_df called museums_large_df that retains only museums with Annual.Revenue greater than $1,000,000,000.
# Filter data frame
# I included the argument .keep_all = TRUE so that all columns are retained, rather than just the one column I wanted to check uniqueness for.
museums_revenue_df <- museums_df %>%
distinct(Legal.Name, .keep_all = TRUE) %>%
filter(Annual.Revenue > 0)
# Filter for only small museums
museums_small_df <- museums_revenue_df %>%
filter(Annual.Revenue < 1e6)
# Filter for only large museums
museums_large_df <- museums_revenue_df %>%
filter(Annual.Revenue > 1e9)
In this part I visualized the distribution of annual revenue for small museums data set. I create a histogram called revenue_histogram using museums_small_df with Annual.Revenue mapped to the x axis.
revenue_histogram <-
ggplot(museums_small_df,
aes(x = Annual.Revenue)) +
geom_histogram(binwidth = 1e4)
revenue_histogram
The x axis was a little hard to read. So I added a scale_x_continuous() layer applying the function scales::dollar_format() to x axis labels. dollar_format() is a function from the scales library included in ggplot2 that adds dollar signs and commas to monetary data.
revenue_histogram <-
ggplot(museums_small_df,
aes(x = Annual.Revenue)) +
geom_histogram(binwidth = 1e4) +
scale_x_continuous(
labels = scales::dollar_format())
revenue_histogram
Then I looked at the variation in revenue for large museums by region. I created a boxplot called revenue_boxplot using museums_large_df, mapping Region.Code..AAM. to the x axis and Annual.Revenue to the y axis. I converted Region.Code..AAM. to a factor so ggplot2 plots its levels as discrete rather than continuous values. I used scale_x_discrete() to rename the numeric region codes to their text equivalents.
revenue_boxplot <-
ggplot(museums_large_df,
aes(
x = factor(Region.Code..AAM.),
y = Annual.Revenue)) +
geom_boxplot() + scale_x_discrete(
labels = c(
"1"="New England",
"2"="Mid-Atlantic",
"3"="Southeastern",
"4"="Midwest",
"5"="Mountain Plains",
"6"="Western"))
revenue_boxplot
The plot was just a little hard to read, since there was one outlier so far above all the other data points. This one museum made a lot of money in 2013! So I zoomed in so I could see the rest of boxes more clearly. I added a coord_cartesian() layer setting ylim to c(1e9, 3e10). This told the plot to zoom in on the y axis range between $1,000,000,000 and $30,000,000,000.
revenue_boxplot <-
ggplot(museums_large_df,
aes(
x = factor(Region.Code..AAM.),
y = Annual.Revenue)) +
geom_boxplot() + scale_x_discrete(
labels = c(
"1"="New England",
"2"="Mid-Atlantic",
"3"="Southeastern",
"4"="Midwest",
"5"="Mountain Plains",
"6"="Western")) +
coord_cartesian(ylim = c(1e9, 3e10))
revenue_boxplot
Though I could see the distribution by region more clearly now, y axis label was hard to understand. I reformatted y axis as billions of dollars.
revenue_boxplot <-
ggplot(museums_large_df,
aes(
x = factor(Region.Code..AAM.),
y = Annual.Revenue)) +
geom_boxplot() + scale_x_discrete(
labels = c(
"1"="New England",
"2"="Mid-Atlantic",
"3"="Southeastern",
"4"="Midwest",
"5"="Mountain Plains",
"6"="Western")) +
coord_cartesian(ylim = c(1e9, 3e10)) +
scale_y_continuous(
labels = function(x) paste0("$", x/1e9, "B"))
revenue_boxplot
Next I looked at revenue across all museums in our dataset. Using museums_revenue_df, I created a bar plot called revenue_barplot mapping Region.Code..AAM. to the x axis and Annual.Revenue to the y axis.
I use stat = “summary” and fun = “mean” to calculate and display the mean revenue by region. I applied the appropriate x and y axis label transformations to make labels more clear.
Once again, I used the labs layer to make the plot more clear. I titled the plot “Mean Annual Revenue by Region”, relabeled the y axis title to “Mean Annual Revenue”, and the x axis title to “Region”.
revenue_barplot <-
ggplot(museums_revenue_df,
aes(
x = factor(Region.Code..AAM.),
y = Annual.Revenue)) +
geom_bar(stat = "summary", fun = "mean") +
scale_y_continuous(
labels = scales::dollar_format()) +
scale_x_discrete(
labels = c(
"1"="New England",
"2"="Mid-Atlantic",
"3"="Southeastern",
"4"="Midwest",
"5"="Mountain Plains",
"6"="Western")) +
labs(
title = "Mean Annual Revenue by Region",
y = "Mean Annual Revenue",
x = "Region")
revenue_barplot
Finally, I added some error bars to the means. I needed to calculate standard errors before creating the plot.
I add error bars to the mean revenue by geography bar plot using the geom_errorbar() layer.
I call this new plot revenue_errorbar. The new y variable is the calculated Mean.Revenue column.
I changed the stat being used, since I was displaying the calculated means as was rather than calculating them as I created the plot.
# Calculate means and standard errors
museums_error_df <- museums_revenue_df %>%
group_by(Region.Code..AAM.) %>%
summarize(
Mean.Revenue = mean(Annual.Revenue),
Mean.SE = std.error(Annual.Revenue)) %>%
mutate(
SE.Min = Mean.Revenue - Mean.SE,
SE.Max = Mean.Revenue + Mean.SE)
# Create and print bar plot with means and standard errors
revenue_errorbar <-
ggplot(museums_error_df,
aes(
x = factor(Region.Code..AAM.),
y = Mean.Revenue)) +
geom_bar(stat = "identity") +
geom_errorbar(
aes(ymin = SE.Min, ymax=SE.Max), width=0.2) +
scale_y_continuous(
labels = scales::dollar_format()) +
scale_x_discrete(
labels = c(
"1"="New England",
"2"="Mid-Atlantic",
"3"="Southeastern",
"4"="Midwest",
"5"="Mountain Plains",
"6"="Western")) +
labs(
title = "Mean Annual Revenue by Region",
y = "Mean Annual Revenue",
x = "Region")
revenue_errorbar