Homework 5 DACSS 601

Ongoing statistical analysis of Museums

Steph
2022-06-28

Homework goals

Introduction

Museums are an important part of community, culture, and an educational environment. The American Alliance of Museums (AAM)reports that museums support over 726,000 American jobs and contribute $50 billion to the US economy each year. The economic activity of museums generates over $12 billion in tax revenue, one-third of it going to state and local governments. Each job created by the museum sector results in $16,495 in additional tax revenue. Additionally, multiple studies and surveys about the best places to like in the US incorporate concentrations of accredited museums.

Places that both serve the public and partner with schools, Museums provide an educational environment for entire communities. Museums help teach the state and local curricula, tailoring their programs in math, science, art, literacy, language arts, history, civics and government, economics and financial literacy, geography, and social studies. AAM reports 97% of Americans believe that museums are educational assets for their communities. There are endless benefits to supporting museums in the US. One such way is the compilation and analysis of data to ensure grants can be given, provide tax-exemptions, and gain a better understanding of the locations and museum types that produce the highest revenue.

The following is an analysis of Museums, Aquariums, Zoos, and related organizations in the United State. The data includes information about each organization such as name, address, income, revenue, and discipline type. The discipline type is based on the National Taxonomy of Exempt Entities, which the National Center for Charitable Statistics and IRS use to classify nonprofit organizations. The data set is available on Kaggle and is under public domain. The data was originally compiled in 2014. The museum data was compiled from IMLS administrative records for discretionary grant recipients, IRS records for tax-exempt organizations, and private foundation grant recipients.

The research questions for which this analysis aims to provide insight include:
Question 1: What is the most common type of museum in the United States?
Question 2: Which states have the highest average revenue from museums?
Question 3: What types of museums produce the highest and lowest average revenue?
Question 4: What type of museum had the highest total income in 2014?
Question 5: What is the most common type of museum in each state?
Question 6: How does income relate to revenue among U.S. museums?
Question 7: What is the museum type with the highest total revenue?

Read in Data

First, R packages are loaded to aid in analysis.
To import the data, we can use read_csv to upload the file and assign it the name “mus.”

knitr::opts_chunk$set(echo = TRUE)

library(tidyverse)
library(dplyr)
library(usmap)
library(ggplot2)
library(plotly)
library(mapview)
library(forcats)
library(tigris, options(tigris_refresh=TRUE))
library(tidycensus)

#Import dataset
mus <- read_csv("museums.csv")

#View data frame
head(mus)
# A tibble: 6 × 25
  `Museum ID` `Museum Name`              `Legal Name` `Alternate Name`
        <dbl> <chr>                      <chr>        <chr>           
1  8400200098 ALASKA AVIATION HERITAGE … ALASKA AVIA… <NA>            
2  8400200117 ALASKA BOTANICAL GARDEN    ALASKA BOTA… <NA>            
3  8400200153 ALASKA CHALLENGER CENTER … ALASKA CHAL… <NA>            
4  8400200143 ALASKA EDUCATORS HISTORIC… ALASKA EDUC… <NA>            
5  8400200027 ALASKA HERITAGE MUSEUM     ALASKA AVIA… <NA>            
6  8400200096 ALASKA HISTORICAL MUSEUM   ALASKA HIST… <NA>            
# … with 21 more variables: `Museum Type` <chr>,
#   `Institution Name` <chr>,
#   `Street Address (Administrative Location)` <chr>,
#   `City (Administrative Location)` <chr>,
#   `State (Administrative Location)` <chr>,
#   `Zip Code (Administrative Location)` <dbl>,
#   `Street Address (Physical Location)` <chr>, …

Data cleaning

Edit columns

#Determine column names
colnames(mus)
 [1] "Museum ID"                               
 [2] "Museum Name"                             
 [3] "Legal Name"                              
 [4] "Alternate Name"                          
 [5] "Museum Type"                             
 [6] "Institution Name"                        
 [7] "Street Address (Administrative Location)"
 [8] "City (Administrative Location)"          
 [9] "State (Administrative Location)"         
[10] "Zip Code (Administrative Location)"      
[11] "Street Address (Physical Location)"      
[12] "City (Physical Location)"                
[13] "State (Physical Location)"               
[14] "Zip Code (Physical Location)"            
[15] "Phone Number"                            
[16] "Latitude"                                
[17] "Longitude"                               
[18] "Locale Code (NCES)"                      
[19] "County Code (FIPS)"                      
[20] "State Code (FIPS)"                       
[21] "Region Code (AAM)"                       
[22] "Employer ID Number"                      
[23] "Tax Period"                              
[24] "Income"                                  
[25] "Revenue"                                 

Rename

The stock column names include spaces and capital letters. For simplicity, we will amend the the column names to snake case.

#Rename columns
colnames(mus) <- c("museum_id", "museum_name", "legal_name", "alt_name", "museum_type", "inst_name", "admin_address", "admin_city", "admin_state", "admin_zip_code", "physical_address", "physical_city", "physical_state", "physical_zip_code", "phone_number", "latitude", "longitude", "locale_code", "county_code", "state_code", "region_code", "employer_id", "tax_period", "income", "revenue")

Eliminate

The data set includes many lines that are not needed for analysis, such as phone number and county code. We can drop the unneeded columns by selecting the ones we want to keep and then assigning to the dataframe.

keep_col <- c("museum_id","legal_name","museum_type","admin_city","admin_state","income","revenue")
mus <-mus[keep_col]

colnames(mus)
[1] "museum_id"   "legal_name"  "museum_type" "admin_city" 
[5] "admin_state" "income"      "revenue"    

Mutate

Museum type is an import column for our analysis, but it includes some very long character strings. We can edit this by using mutate and choosing new type names.

#Simplify museum type labels
mus <- mus%>%
  mutate(mus_type = case_when(
    (museum_type == "ARBORETUM, BOTANICAL GARDEN, OR NATURE CENTER") ~ "Garden",
    museum_type == "ART MUSEUM" ~ "Art",
    museum_type == "CHILDREN'S MUSEUM" ~ "Children",
    museum_type == "GENERAL MUSEUM" ~ "General",
    museum_type == "HISTORIC PRESERVATION" ~ "Preservation",
    museum_type == "HISTORY MUSEUM" ~ "History",
    museum_type == "NATURAL HISTORY MUSEUM" ~ "Natrual History",
    museum_type == "SCIENCE & TECHNOLOGY MUSEUM OR PLANETARIUM" ~ "Science",
    museum_type == "ZOO, AQUARIUM, OR WILDLIFE CONSERVATION " ~ "Zoo",
  )) 

#Check that museum types were edited
mus %>% 
  select(mus_type)%>%
  slice(1:10)
# A tibble: 10 × 1
   mus_type    
   <chr>       
 1 History     
 2 Garden      
 3 Science     
 4 Preservation
 5 History     
 6 Preservation
 7 General     
 8 Preservation
 9 General     
10 History     

Missing values

The data set has quite a few missing values that will interfere with any analysis. To rectify that, we can find and omit the rows with missing values.

#Identifying missing values
sum(is.na(mus))
[1] 21457
#Remove missing values
mus <- na.omit(mus)

#Verify missing values are omitted
sum(is.na(mus))
[1] 0

Multiple rows of data include income or revenue values less than 0. The negative value data may be erroneous, or at least interfere with statistical analysis. We can remove those rows by filtering only the data we want to keep.

#Removing all rows with income and revenue less than 0
mus <- mus %>%
  filter((income >=0) & (revenue >=0))

#Verify all rows with negative revenue are eliminated
sum(mus$revenue <0, na.rm=TRUE)
[1] 0

Exploration and Visualizations of the Data to answer key questions

Question 1: What is the most common type of museum in the United States?

#To find the most common museum type in the US we essential want the mode, but of a categorical variable
#Let's create a tibble of the museum type counts and frequencies 
mus %>%
  group_by(mus_type) %>%
  summarise(count=n()) %>%
  mutate(frequency = count/sum(count))
# A tibble: 8 × 3
  mus_type        count frequency
  <chr>           <int>     <dbl>
1 Art              1843    0.0839
2 Children          325    0.0148
3 Garden            741    0.0337
4 General          4015    0.183 
5 History          2023    0.0921
6 Natrual History   235    0.0107
7 Preservation    12367    0.563 
8 Science           409    0.0186

As we can see, Preservation, which stands for “HISTORIC PRESERVATION” in the original data, appears to be the most frequent museum type. However, we can also see that it may be skewed data because that category has a much higher value than the others. The relative frequency shows us that Preservations account for over 56% of all museums accounted for in our data. Let’s visualize these categories.

#To Visualize the frequency of museum types, a single categorical variable, we can create a bar chart

mus%>%
  drop_na(mus_type)%>%
  ggplot(mus, mapping = aes(x=mus_type, fill=mus_type))+
  geom_bar()+
  labs (title="Total U.S. Museum Types", fill = "Museum Type")+
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
      axis.ticks.x=element_blank())

The bar graph visually verifies that (historic) preservations are the most common type of museum in the US. However, it does also reveal that Preservations are disproportionately the most common. The term Historic Preservation refers to a location or building(s) that preserves the history of the area and can often be viewed or explored by the public. That could be due to the fact that there are truly many more historic prevervations, or due to the data collection process, which may have grouped many organizations into one group if they did not fit into other categories.

To investigate our research question slightly further, let us ask a related question. What are the top 3 most common museum types in the U.S.?

#First, we can create a new column with counts of each museum type category
mus$mus_type <- as.character(mus$mus_type)
mus$count <- as.numeric(ave(mus$mus_type, mus$mus_type, FUN = length))

#Then, we can group and arrange to find the top 3
(mt <- mus %>%
  select(count, mus_type)%>%
  group_by(mus_type) %>%
  arrange(desc(count)) %>%
  slice(1)%>%
  ungroup()%>%
  arrange(desc(count))%>%
  slice(1:3))
# A tibble: 3 × 2
  count mus_type    
  <dbl> <chr>       
1 12367 Preservation
2  4015 General     
3  2023 History     
mt%>%
  mutate(mus_type = fct_reorder(mus_type, count)) %>% 
  ggplot(., mapping = aes(x=mus_type, y=count, color=mus_type))+
  geom_bar(stat='identity', fill= "white")+
  labs (x="Museum Type", y = "Number of Museums", title="Top 3 Most Common U.S. Museum Types")+
  theme(legend.position="none")  

This shows us the General and History museum categories are the second and third most common, respectively, in the U.S. after Preservations.

Question 2: Which states have the highest average revenue from museums?

#Find mean revenue of each state
(state_rev <- mus %>%
  group_by(admin_state) %>%
  select(revenue) %>%
  summarize_all(mean, na.rm = TRUE)%>%
  arrange(desc(revenue)))
# A tibble: 51 × 2
   admin_state    revenue
   <chr>            <dbl>
 1 DC          225036990.
 2 MA          172576161.
 3 CT           72599918.
 4 MD           59126876.
 5 DE           47295384.
 6 RI           41564597.
 7 TN           38247536.
 8 GA           33161716.
 9 CA           28167741.
10 IL           27979106.
# … with 41 more rows

The tibble indicated DC, MA, and CT are the locations with the highest revenue in the U.s.. To visualize revenue by state, which is a continuous variable by a discrete, we can create a map.

#Pull and rename state map data
usmap <- map_data("state")

#Create a column to match map data
mus$region <- mus$admin_state

#Join map data with museum data
museum_map <- full_join(mus, usmap, by = "region")

#Create new data frame
state_rev <- as.data.frame(state_rev)

#Create new column names "state"
state_rev$state <- state_rev$admin_state

#Create a map
plot_usmap(
  regions = "states", data = state_rev, values = "revenue")+
  scale_fill_continuous(name= "US Museum Revenue in 2014", label = scales::comma)+
  theme(legend.position = "right")

A limitation of this visual is the disproportionate sizes of the states. Particularly with DC being the top of our list and the smallest possible area to be highlighted on this map. You can clearly see that MA is toward the top of this list, however, with CT and MD not far behind.

#Create a column bar graph
ggplot(state_rev, mapping = aes(x = state, y = revenue, color = state))+
  geom_col(width = 0.75, position = position_dodge(0.7), fill= "white")+
  labs (x="State", y = "Revenue", title = "Average Revenue by State")+
  theme (legend.position = "none", axis.text.x = element_text(margin = unit(c(0, 0, 0, 0), "mm"), angle = 90))

Te column bar graph can illustrate state-by-state averages more clearly. We can see that DC and MA are have the highest average revenues at a quick glance.

Question 3: What types of museums produce the highest and lowest average revenue?

#Find mean revenue by museum type
top_rev <- mus %>%
  group_by(mus_type) %>%
  select(revenue, mus_type) %>%
  summarize_all(mean, na.rm = TRUE) %>%
  ungroup() %>%
  arrange(desc(revenue)) %>%
  slice(1:3)

top_rev
# A tibble: 3 × 2
  mus_type           revenue
  <chr>                <dbl>
1 Science         115153455.
2 Art             100842181.
3 Natrual History  85498324.

The museum types with the highest average revenue are Science, Art, and Natural History museums, respectively.

low_rev <- mus %>%
  group_by(mus_type) %>%
  select(revenue, mus_type) %>%
  summarize_all(mean) %>%
  ungroup() %>%
  arrange(revenue) %>%
  slice(1:3)

low_rev
# A tibble: 3 × 2
  mus_type      revenue
  <chr>           <dbl>
1 Children     1355364.
2 Preservation 2038106.
3 History      9430677.

The museum types with the lowest average revenue are Children, Preservation, and History museums, respectively.

#Create a new column of revenue averages by museum type
mus <- mus %>%
  group_by (mus_type) %>%
  mutate (ave_rev = mean(revenue))
  

#Create a boxplot visual
mus%>%
  drop_na (mus_type) %>%
  ggplot (., mapping = aes(x = revenue, y = mus_type))+
  geom_boxplot()+
  labs (title = "Revenue by Museum type", x = "Revenue", y = "Museum Type")+
  theme (legend.position="none")+
  facet_wrap (vars(mus_type), scales = "free_y")

*More work needed to find the best visualization. I am struggling with this one, but still working on it.

Question 4: What type of museum had the highest total income in 2014?

#Create a data frame to work with
type_income <- data.frame(mus %>% 
  select(mus_type, income) %>%
  group_by(mus_type) %>% 
  mutate(total_income = sum(income))%>% #New column with total income
  arrange(desc(total_income)))  #Arrange by highest total income

#Confirm the new object is a data frame and not a tibble
is.data.frame(type_income)
[1] TRUE
#Create a pie chart  
ggplot(type_income, aes(x="", y = total_income, fill = mus_type)) +
  geom_bar(stat="identity", width=1) +
  coord_polar("y", start=0)+
  labs (x = "", y = "Income", title = "Total Income per Museum Type", legend = "Museum Type")

Art museums had the highest total income in 2014 in the U.S..

Question 5: What is the most common type of museum in each state?

#Find most common museum type by state
state_type <- mus %>% 
    group_by(admin_state) %>% 
    count(mus_type) %>% 
    slice(which.max(n)) %>% 
    select(-n)

#Create data frame
state_type_df <- as.data.frame(state_type)
head(state_type_df)
  admin_state     mus_type
1          AK Preservation
2          AL Preservation
3          AR Preservation
4          AZ Preservation
5          CA Preservation
6          CO Preservation
#Confirm the data includes DC as a state by a count
length(state_type_df$admin_state)
[1] 51
#Count number of states in which Preservation is the most common
length(which(state_type_df$mus_type == "Preservation"))
[1] 49

As expected, given our understanding of the most common type of museum, the Preservation museum type is the most common in all but 2 states.

Question 6: How does income relate to revenue among U.S. museums?

#Create a scatterplot of museum income vs. revenue.
mus %>% 
  ggplot(.,aes(x = income, y = revenue))+
  geom_point(size = 3, shape = 23)+
  geom_smooth()+
  labs (x="Income", y = "Revenue", title = "U.S. Museum 2014 Income vs. Revenue")

The above scatterplot shows a strong positive, linear association between the museums’ incomes and revenues with a few potential outliers.

#We can remove the far most outlier to zoom in a bit
scatter_plot_crop <- mus %>% 
  ggplot(.,aes(x = income, y = revenue))
scatter_plot_crop+
  geom_point(size = 3, shape = 23, outlier.shape = NA)+
  coord_cartesian(ylim =  c(0,5300000000))+
  geom_smooth()+
  labs (x="Income", y = "Revenue", title = "Outlier Removed - U.S. Museum Income vs. Revenue")

Excluding our major outlier, we can see there is a strong correlation between income and revenue.

Question 7: What is the museum type with the highest total revenue?

#Create data frame with summarized revenue
total_rev <- as.data.frame(mus %>%
  select(mus_type, admin_state, revenue, admin_state) %>%
  group_by(mus_type) %>%
  mutate(total_revenue = sum(revenue)) %>% 
  arrange(desc(total_revenue))) %>% 
  ungroup()%>%
  arrange(desc(total_revenue))

#Column chart with total revenue by museum type
total_rev %>% 
  mutate(mus_type = fct_reorder(mus_type, total_revenue)) %>%
  ggplot(aes(mus_type, total_revenue, fill = mus_type))+
  geom_col()+
  theme(axis.title.x = element_blank())+
  labs (x="Museum Type", y = "Total Revenue", title = "Total Revenue by of U.S. Museum Types", fill = "Museum Type")+
  theme(axis.text.x=element_blank(),
      axis.ticks.x=element_blank())
#Verify the top three types by revenue
total_rev %>% 
  group_by(mus_type) %>%
  summarize(tot_rev = sum(revenue)) %>% 
  arrange(desc(tot_rev)) %>% 
  slice (1:3)
# A tibble: 3 × 2
  mus_type      tot_rev
  <chr>           <dbl>
1 Art      185852139342
2 General  110410839110
3 Garden    57786384393

The museum type with the highest total revenue from 2014 is Art Museums, followed by General Museums and Gardens (Arboretum, Botanical Garden, or Nature Center).

Reflection

The research conclusions have been added as the analysis and visualizations are presented. The naive reader will be able to follow along by the notes and labels added to each code and visual. There are two questions I would like to answer but have yet to accomplish; 1. Which US cities have more than 100 museums?, and 2. What type of museum has more museums that are accessible for free? Question 1 is likely possible to answer with the given data. The limiting factor is my coding skill. I will continue trying. Question 2 may not be possible to answer given the dataset. The data could be used to count how many museums of each type has an income or revenue of zero. However, that might not indicate free admission. Also, some organizations may have to include donations as income, which would unnecessarily exclude them from that analysis.

Coming soon…
“Describe your process/experience with the project, including the decisions you made, what was most challenging, and what you wish you would have known. You can also discuss what the next steps would be, were you to continue with the project.”

Conclusion

Coming soon…
“Explain what conclusions you can draw from your work, and what questions are still left unanswered.”

Bibliography

Coming soon…
“Include all relevant citations (Hint, at minimum you should cite the source of the dataset, R as a programming language, and the course textbook).”

museum information: https://www.aam-us.org/programs/about-museums/museum-facts-data/
kaggle: https://www.kaggle.com/code/priankravichandar/museums-data-analysis
R: R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Course textbook: Grolemund, G., & Wickham, H. (2017). R for Data Science. O’Reilly Media.