Homework 6 DACSS 601

Final Project Rough Draft

Steph
2022-07-01

Introduction

Museums are an important part of community, culture, and an educational environment. The American Alliance of Museums (AAM)reports that museums support over 726,000 American jobs and contribute $50 billion to the US economy each year. The economic activity of museums generates over $12 billion in tax revenue, one-third of it going to state and local governments. Each job created by the museum sector results in $16,495 in additional tax revenue. Additionally, multiple studies and surveys about the best places to like in the US incorporate concentrations of accredited museums.

Places that both serve the public and partner with schools, Museums provide an educational environment for entire communities. Museums help teach the state and local curricula, tailoring their programs in math, science, art, literacy, language arts, history, civics and government, economics and financial literacy, geography, and social studies. AAM reports 97% of Americans believe that museums are educational assets for their communities. There are endless benefits to supporting museums in the US. One such way is the compilation and analysis of data to ensure grants can be given, provide tax-exemptions, and gain a better understanding of the locations and museum types that produce the highest revenue.

The following is an analysis of Museums, Aquariums, Zoos, and related organizations in the United State. The data includes information about each organization such as name, address, income, revenue, and discipline type. The discipline type is based on the National Taxonomy of Exempt Entities, which the National Center for Charitable Statistics and IRS use to classify nonprofit organizations. The data set is available on Kaggle and is under public domain. The data was originally compiled in 2014. The museum data was compiled from IMLS administrative records for discretionary grant recipients, IRS records for tax-exempt organizations, and private foundation grant recipients.

The research questions for which this analysis aims to provide insight include:
Question 1: What is the most common type of museum in the United States?
Question 2: Which states have the highest average revenue from museums?
Question 3: What types of museums produce the highest and lowest average revenue?
Question 4: What type of museum had the highest total income in 2014?
Question 5: What is the most common type of museum in each state?
Question 6: How does income relate to revenue among U.S. museums?
Question 7: What is the museum type with the highest total revenue?

Read in Data

First, R packages are loaded to aid in analysis.
To import the data, we can use read_csv to upload the file and assign it the name “mus.”

knitr::opts_chunk$set(echo = TRUE)

library(tidyverse)
library(dplyr)
library(usmap)
library(ggplot2)
library(plotly)
library(mapview)
library(forcats)
library(tigris, options(tigris_refresh=TRUE))
library(tidycensus)
library(colorspace)
library(scales)

#Import dataset
mus <- read_csv("museums.csv")

#View data frame
head(mus)
# A tibble: 6 × 25
  `Museum ID` `Museum Name`              `Legal Name` `Alternate Name`
        <dbl> <chr>                      <chr>        <chr>           
1  8400200098 ALASKA AVIATION HERITAGE … ALASKA AVIA… <NA>            
2  8400200117 ALASKA BOTANICAL GARDEN    ALASKA BOTA… <NA>            
3  8400200153 ALASKA CHALLENGER CENTER … ALASKA CHAL… <NA>            
4  8400200143 ALASKA EDUCATORS HISTORIC… ALASKA EDUC… <NA>            
5  8400200027 ALASKA HERITAGE MUSEUM     ALASKA AVIA… <NA>            
6  8400200096 ALASKA HISTORICAL MUSEUM   ALASKA HIST… <NA>            
# … with 21 more variables: `Museum Type` <chr>,
#   `Institution Name` <chr>,
#   `Street Address (Administrative Location)` <chr>,
#   `City (Administrative Location)` <chr>,
#   `State (Administrative Location)` <chr>,
#   `Zip Code (Administrative Location)` <dbl>,
#   `Street Address (Physical Location)` <chr>, …

Data cleaning

Edit columns

#Determine column names
colnames(mus)
 [1] "Museum ID"                               
 [2] "Museum Name"                             
 [3] "Legal Name"                              
 [4] "Alternate Name"                          
 [5] "Museum Type"                             
 [6] "Institution Name"                        
 [7] "Street Address (Administrative Location)"
 [8] "City (Administrative Location)"          
 [9] "State (Administrative Location)"         
[10] "Zip Code (Administrative Location)"      
[11] "Street Address (Physical Location)"      
[12] "City (Physical Location)"                
[13] "State (Physical Location)"               
[14] "Zip Code (Physical Location)"            
[15] "Phone Number"                            
[16] "Latitude"                                
[17] "Longitude"                               
[18] "Locale Code (NCES)"                      
[19] "County Code (FIPS)"                      
[20] "State Code (FIPS)"                       
[21] "Region Code (AAM)"                       
[22] "Employer ID Number"                      
[23] "Tax Period"                              
[24] "Income"                                  
[25] "Revenue"                                 

Rename

The stock column names include spaces and capital letters. For simplicity, we will amend the the column names to snake case.

#Rename columns
colnames(mus) <- c("museum_id", "museum_name", "legal_name", "alt_name", "museum_type", "inst_name", "admin_address", "admin_city", "admin_state", "admin_zip_code", "physical_address", "physical_city", "physical_state", "physical_zip_code", "phone_number", "latitude", "longitude", "locale_code", "county_code", "state_code", "region_code", "employer_id", "tax_period", "income", "revenue")

Eliminate

The data set includes many lines that are not needed for analysis, such as phone number and county code. We can drop the unneeded columns by selecting the ones we want to keep and then assigning to the dataframe.

keep_col <- c("museum_id","legal_name","museum_type","admin_city","admin_state","income","revenue")
mus <-mus[keep_col]

colnames(mus)
[1] "museum_id"   "legal_name"  "museum_type" "admin_city" 
[5] "admin_state" "income"      "revenue"    

Mutate

Museum type is an import column for our analysis, but it includes some very long character strings. We can edit this by using mutate and choosing new type names.

#Simplify museum type labels
mus <- mus%>%
  mutate(mus_type = case_when(
    (museum_type == "ARBORETUM, BOTANICAL GARDEN, OR NATURE CENTER") ~ "Garden",
    museum_type == "ART MUSEUM" ~ "Art",
    museum_type == "CHILDREN'S MUSEUM" ~ "Children",
    museum_type == "GENERAL MUSEUM" ~ "General",
    museum_type == "HISTORIC PRESERVATION" ~ "Preservation",
    museum_type == "HISTORY MUSEUM" ~ "History",
    museum_type == "NATURAL HISTORY MUSEUM" ~ "Natrual History",
    museum_type == "SCIENCE & TECHNOLOGY MUSEUM OR PLANETARIUM" ~ "Science",
    museum_type == "ZOO, AQUARIUM, OR WILDLIFE CONSERVATION " ~ "Zoo",
  )) 

#Check that museum types were edited
mus %>% 
  select(mus_type)%>%
  slice(1:10)
# A tibble: 10 × 1
   mus_type    
   <chr>       
 1 History     
 2 Garden      
 3 Science     
 4 Preservation
 5 History     
 6 Preservation
 7 General     
 8 Preservation
 9 General     
10 History     

Missing values

The data set has quite a few missing values that will interfere with any analysis. The reason for missing data may be linked to the process of data collection, which was conducted by a third party. It may be that some organizations were reported in a way that excluded pertinent information to this data, but fulfilled the necessary requirements for submission. To rectify the dataset, we can find and omit the rows with missing values.

#Identifying missing values
sum(is.na(mus))
[1] 21457
which(colSums(is.na(mus))>0)
  income  revenue mus_type 
       6        7        8 
#Remove missing values
mus <- na.omit(mus)

#Verify missing values are omitted
sum(is.na(mus))
[1] 0

There were 21457 missing or “NA” values, which were distributed among the income, revenue, and museum type. It could be that income and revenue for some organizations may not have been calculated yet for the year or data collection. The missing museum types could be due to that inability to categorically label some organizations.

Multiple rows of data include income or revenue values less than 0. The negative value data may be due to non-profit public exhibits operating at a loss for the year of 2014. Negative values do not account for a huge portion of our data, but may be erroneous and cannot be verified. Unverified data may interfere with statistical analysis. We can remove those rows by filtering only the data we want to keep.

#Count how many observations include a negative revenue
mus %>% 
  count(income < 0 | revenue < 0)
# A tibble: 2 × 2
  `income < 0 | revenue < 0`     n
  <lgl>                      <int>
1 FALSE                      21958
2 TRUE                          40
#Investigate which rows include negative revenue
mus %>%
  select(mus_type, admin_state, income, revenue) %>% 
  filter(income < 0 | revenue < 0)
# A tibble: 40 × 4
   mus_type     admin_state income revenue
   <chr>        <chr>        <dbl>   <dbl>
 1 Preservation AL           64294   -4221
 2 General      AR           56592   -2579
 3 Preservation AR           24000   -6000
 4 Art          AR           99811 -106157
 5 Preservation CA           60174   -3455
 6 General      CT            3733  -25400
 7 Science      DC           63401   -4997
 8 Science      FL          194240 -598813
 9 Preservation FL           39874   -8973
10 Preservation FL          165338  -14156
# … with 30 more rows
#Removing all rows with income and revenue less than 0
mus <- mus %>%
  filter((income >=0) & (revenue >=0))

#Verify all rows with negative revenue are eliminated
sum(mus$revenue <0, na.rm=TRUE)
[1] 0

Exploration and Visualizations of the Data to answer key questions

Question 1: What is the most common type of museum in the United States?

#To find the most common museum type in the US we essential want the mode, but of a categorical variable
#Let's create a tibble of the museum type counts and frequencies 
mus %>%
  group_by(mus_type) %>%
  summarise(count=n()) %>%
  mutate(frequency = count/sum(count))
# A tibble: 8 × 3
  mus_type        count frequency
  <chr>           <int>     <dbl>
1 Art              1843    0.0839
2 Children          325    0.0148
3 Garden            741    0.0337
4 General          4015    0.183 
5 History          2023    0.0921
6 Natrual History   235    0.0107
7 Preservation    12367    0.563 
8 Science           409    0.0186

As we can see, Preservation, which stands for “HISTORIC PRESERVATION” in the original data, appears to be the most frequent museum type. However, we can also see that it may be skewed data because that category has a much higher value than the others. The relative frequency shows us that Preservations account for over 56% of all museums accounted for in our data. Let’s visualize these categories.

#To Visualize the frequency of museum types, a single categorical variable, we can create a bar chart

mus%>%
  drop_na(mus_type)%>%
  ggplot(mus, mapping = aes(x=mus_type, fill=mus_type))+
  geom_bar()+
  labs (title="Total U.S. Museum Types", fill = "Museum Type")+
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
      axis.ticks.x=element_blank())

The bar graph visually verifies that (historic) preservations are the most common type of museum in the US. However, it does also reveal that Preservations are disproportionately the most common. The term Historic Preservation refers to a location or building(s) that preserves the history of the area and can often be viewed or explored by the public. That could be due to the fact that there are truly many more historic prevervations, or due to the data collection process, which may have grouped many organizations into one group if they did not fit into other categories.

To investigate our research question slightly further, let us ask a related question. What are the top 3 most common museum types in the U.S.?

#First, we can create a new column with counts of each museum type category
mus$mus_type <- as.character(mus$mus_type)
mus$count <- as.numeric(ave(mus$mus_type, mus$mus_type, FUN = length))

#Then, we can group and arrange to find the top 3
(mt <- mus %>%
  select(count, mus_type)%>%
  group_by(mus_type) %>%
  arrange(desc(count)) %>%
  slice(1)%>%
  ungroup()%>%
  arrange(desc(count))%>%
  slice(1:3))
# A tibble: 3 × 2
  count mus_type    
  <dbl> <chr>       
1 12367 Preservation
2  4015 General     
3  2023 History     
mt%>%
  mutate(mus_type = fct_reorder(mus_type, count)) %>% 
  ggplot(., mapping = aes(x=mus_type, y=count, color=mus_type))+
  geom_bar(stat='identity', fill= "white")+
  labs (x="Museum Type", y = "Number of Museums", title="Top 3 Most Common U.S. Museum Types")+
  theme(legend.position="none")  

This shows us the General and History museum categories are the second and third most common, respectively, in the U.S. after Preservations.

Question 2: Which states have the highest average revenue from museums?

#Find mean revenue of each state
(state_rev <- mus %>%
  group_by(admin_state) %>%
  select(revenue) %>%
  summarize_all(mean, na.rm = TRUE)%>%
  arrange(desc(revenue)))
# A tibble: 51 × 2
   admin_state    revenue
   <chr>            <dbl>
 1 DC          225036990.
 2 MA          172576161.
 3 CT           72599918.
 4 MD           59126876.
 5 DE           47295384.
 6 RI           41564597.
 7 TN           38247536.
 8 GA           33161716.
 9 CA           28167741.
10 IL           27979106.
# … with 41 more rows

The tibble indicated DC, MA, and CT are the locations with the highest revenue in the U.s.. To visualize this, we can create a histogram of the average revenue to see the distribution.

#Create a histogram
ggplot(state_rev, aes(revenue))+
  geom_histogram(position="identity", color = 4, fill = "white", bins = 50)+
  labs (x="Revenue", y = "Density", title="Average Museum Revenues of U.S. States")+
  theme(legend.position="none")

The histogram shows us there are two major outliers. Let’s find out what locations they are and then remove them to visualize the remaining states better.

#Recall the revenues of the top 3 states
state_rev[1:3,]
# A tibble: 3 × 2
  admin_state    revenue
  <chr>            <dbl>
1 DC          225036990.
2 MA          172576161.
3 CT           72599918.

This shows us DC has the highest average revenue for museums, which means it is represented in the farthest right histogram column in our last visual. MA, being the second highest in that category, is the next farthest outlier. Though we should keep these two observations in mind, it will be helpful to remove them for a zoomed-in look at the remaining data. We can use the revenue from our 3rd highest, CT at 72.5 million, to help filter our data for an improved visual.

#Filter only the states with an average revenue of under 75,000,000.
state_rev_fil <- filter(state_rev, revenue < 7.5e+07)

#Remake a histogram with the filtered data
ggplot(state_rev_fil, aes(revenue))+
  geom_histogram(position="identity", color = 2, fill = "orange", bins = 50)+
  labs (x="Revenue", y = "Density", title="Average Museum Revenues of U.S. States")+
  theme(legend.position="none")

That is already much better. But since the data includes locations, it may be best visualized another way. To visualize revenue by state, which is a continuous variable by a discrete, we can create a map.

#Pull and rename state map data
usmap <- map_data("state")

#Create a column to match map data
mus$region <- mus$admin_state

#Join map data with museum data
museum_map <- full_join(mus, usmap, by = "region")

#Create new data frame
state_rev <- as.data.frame(state_rev)

#Create new column names "state"
state_rev$state <- state_rev$admin_state

#Calculate percentiles for defining legend
x <- state_rev_fil$revenue
quantile(x, probs = seq(0, 1, 0.25))
         0%         25%         50%         75%        100% 
   78231.96  2662174.21  5610428.49 13609961.02 72599918.01 
#Create a vector to define legend
rev_breaks <- c(2.6e+06, 5.6e+06, 1.3e+07)

#Create a map
plot_usmap(
  regions = "states", data = state_rev, values = "revenue")+
  theme(panel.background = element_rect(colour = "black"))+
  scale_fill_continuous(low = "white", high ="darkblue", 
                          name = "Mean Revenue",label = scales::comma,
                          limits = c(0,7.5e+07)) + 
  theme(legend.position = "right")+
  labs (title="Average Museum Revenues of U.S. States")

A limitation of this visual lies in out outliers being extracted from the color gradient. As it may be obvious with MA, and less obvious with DC, the top two locations with the highest average revenue are not highlighted in the color scale because their numbers skew the data in a way that obscures the visualization of the revenue in all other states. This map visual does, however, allow us to quickly assess which the distribution of average museum revenue by state.

#Create a column bar graph
ggplot(state_rev, mapping = aes(x = state, y = revenue, color = state))+
  geom_col(width = 0.75, position = position_dodge(0.7), fill= "white")+
  labs (x="State", y = "Revenue", title = "Average Revenue by State")+
  theme (legend.position = "none", axis.text.x = element_text(margin = unit(c(0, 0, 0, 0), "mm"), angle = 90))

Te column bar graph can illustrate state-by-state averages more clearly. We can see that DC and MA are have the highest average revenues at a quick glance.

Question 3: What types of museums produce the highest and lowest average revenue?

#Find mean revenue by museum type
top_rev <- mus %>%
  group_by(mus_type) %>%
  select(revenue, mus_type) %>%
  summarize_all(mean, na.rm = TRUE) %>%
  ungroup() %>%
  arrange(desc(revenue)) %>%
  slice(1:3)

top_rev
# A tibble: 3 × 2
  mus_type           revenue
  <chr>                <dbl>
1 Science         115153455.
2 Art             100842181.
3 Natrual History  85498324.

The museum types with the highest average revenue are Science, Art, and Natural History museums, respectively.

low_rev <- mus %>%
  group_by(mus_type) %>%
  select(revenue, mus_type) %>%
  summarize_all(mean) %>%
  ungroup() %>%
  arrange(revenue) %>%
  slice(1:3)

low_rev
# A tibble: 3 × 2
  mus_type      revenue
  <chr>           <dbl>
1 Children     1355364.
2 Preservation 2038106.
3 History      9430677.

The museum types with the lowest average revenue are Children, Preservation, and History museums, respectively.

#Create a new column of revenue averages by museum type
mus <- mus %>%
  group_by (mus_type) %>%
  mutate (ave_rev = mean(revenue))
  

#Create a boxplot visual
mus%>%
  drop_na (mus_type) %>%
  ggplot (., mapping = aes(x = revenue, y = mus_type))+
  geom_boxplot()+
  labs (title = "Revenue by Museum type", x = "Revenue", y = "Museum Type")+
  theme (legend.position="none")+
  facet_wrap (vars(mus_type), scales = "free_y")

*More work needed to find the best visualization. I am struggling with this one, but still working on it.

Question 4: What type of museum had the highest total income in 2014?

#Create a data frame to work with
type_income <- data.frame(mus %>% 
  select(mus_type, income) %>%
  group_by(mus_type) %>% 
  mutate(total_income = sum(income))%>% #New column with total income
  arrange(desc(total_income)))  #Arrange by highest total income

#Confirm the new object is a data frame and not a tibble
is.data.frame(type_income)
[1] TRUE
#Create a pie chart  
#ggplot(type_income, aes(x="", y = total_income, fill = mus_type)) +
  #geom_bar(stat="identity", width=1) +
  #coord_polar("y", start=0)+
  #labs (x = "", y = "Income", title = "Total Income per Museum Type", legend = "Museum Type")

Art museums had the highest total income in 2014 in the U.S..

Question 5: What is the most common type of museum in each state?

#Find most common museum type by state
state_type <- mus %>% 
    group_by(admin_state) %>% 
    count(mus_type) %>% 
    slice(which.max(n)) %>% 
    select(-n)

#Create data frame
state_type_df <- as.data.frame(state_type)
head(state_type_df)
  admin_state     mus_type
1          AK Preservation
2          AL Preservation
3          AR Preservation
4          AZ Preservation
5          CA Preservation
6          CO Preservation
#Confirm the data includes DC as a state by a count
length(state_type_df$admin_state)
[1] 51
#Count number of states in which Preservation is the most common
length(which(state_type_df$mus_type == "Preservation"))
[1] 49

As expected, given our understanding of the most common type of museum, the Preservation museum type is the most common in all but 2 states.

#Identify the states where Preservation is not the most common
state_type_df %>% 
  select(admin_state, mus_type) %>% 
  filter(mus_type != "Preservation")
  admin_state mus_type
1          DC  General
2          WY  General

Wyoming and DC are the “states” that do not have Preservations are their most common museum. Interestingly, they share the most common type, which is General museums.

Question 6: How does income relate to revenue among U.S. museums?

#Create a scatterplot of museum income vs. revenue.
mus %>% 
  ggplot(.,aes(x = income, y = revenue))+
  geom_point(size = 3, shape = 23)+
  geom_smooth()+
  labs (x="Income", y = "Revenue", title = "U.S. Museum 2014 Income vs. Revenue")

The above scatterplot shows a strong positive, linear association between the museums’ incomes and revenues with a number of potential outliers.

#We can remove the far most outlier to zoom in a bit
#Compute interquartile range to identify outliers
Q1 <- quantile(mus$income, .25)
Q3 <- quantile(mus$income, .75)
(IQR <- IQR(mus$income))
[1] 195713.2
(low_lim = Q1 - 1.5 * IQR)
      25% 
-293569.9 
(up_lim = Q3 + 1.5 * IQR)
     75% 
489283.1 
#Create tibble to view the outliers
outliers <- mus %>% 
  select(mus_type, admin_state, income) %>%
  filter(income > up_lim | income < low_lim) %>% 
  arrange(income)
outliers
# A tibble: 3,823 × 3
# Groups:   mus_type [8]
   mus_type     admin_state income
   <chr>        <chr>        <dbl>
 1 Preservation VA          489376
 2 Children     KS          489569
 3 Preservation NY          489898
 4 General      FL          490033
 5 Preservation ND          490797
 6 Science      LA          492389
 7 General      NY          493409
 8 Preservation PA          493734
 9 Preservation NY          494792
10 Preservation NY          494959
# … with 3,813 more rows
#Calculate percent of data we would drop if we extract outliers
percent((nrow(outliers))/(nrow(mus)), accuracy = 0.01)
[1] "17.41%"
#Remove the outliers
income_fil <- as.data.frame (mus %>% 
  filter(income < up_lim & income > low_lim))

income_fil %>% 
  ggplot(.,aes(x = income, y = revenue))+
  geom_point(size = 2, shape = 23, color = 6)+
  geom_smooth()+
  labs (x="Income", y = "Revenue", title = "Outlier Removed - U.S. Museum Income vs. Revenue")

Extracing the statistical outliers has returned a subset of the data that has a very strong positive correlation between income and revenue.

Excluding our major outlier, we can see there is a strong correlation between income and revenue.

Question 7: What is the museum type with the highest total revenue?

#Create data frame with summarized revenue
total_rev <- as.data.frame(mus %>%
  select(mus_type, admin_state, revenue, admin_state) %>%
  group_by(mus_type) %>%
  mutate(total_revenue = sum(revenue)) %>% 
  arrange(desc(total_revenue))) %>% 
  ungroup()%>%
  arrange(desc(total_revenue))

#Column chart with total revenue by museum type
total_rev %>% 
  mutate(mus_type = fct_reorder(mus_type, total_revenue)) %>%
  ggplot(aes(mus_type, total_revenue, fill = mus_type))+
  geom_col()+
  theme(axis.title.x = element_blank())+
  labs (x="Museum Type", y = "Total Revenue", title = "Total Revenue by of U.S. Museum Types", fill = "Museum Type")+
  theme(axis.text.x=element_blank(),
      axis.ticks.x=element_blank())
#Verify the top three types by revenue
total_rev %>% 
  group_by(mus_type) %>%
  summarize(tot_rev = sum(revenue)) %>% 
  arrange(desc(tot_rev)) %>% 
  slice (1:3)
# A tibble: 3 × 2
  mus_type      tot_rev
  <chr>           <dbl>
1 Art      185852139342
2 General  110410839110
3 Garden    57786384393

The museum type with the highest total revenue from 2014 is Art Museums, followed by General Museums and Gardens (Arboretum, Botanical Garden, or Nature Center).

Reflection

A significant amount of time was spent analyzing question 6 much deeper. Namely, the statistics which were used to identify the outliers was a learning curve. However, the end result satisfies the goal well. Additionally, a substantial amount of time was spent find the best way to adjust the range on the map to properly show the distribution of values after figuring the best way to filter out the top most outlier, which was skewing the original scale. The 2 states whose most common museum is not preservation was identified. Additional notes were made regarding missing values, including theorizing the cause and effect of whether to leave them, as well as adding computations to support the ideas.

The plans going forward include finding a replacement visual for the pie chart in question 4, adjust or replace visual in question 3, and add statistics to question 5. The statistics may include things like frequency and/or prop tables to show frequency of museum type across the country. Then, the data gathered from each question can be summarized in an organized way in the conlusion.

Conclusion

Coming soon…
“Explain what conclusions you can draw from your work, and what questions are still left unanswered.”

Bibliography

R: R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

museum information: https://www.aam-us.org/programs/about-museums/museum-facts-data/

kaggle: https://www.kaggle.com/code/priankravichandar/museums-data-analysis

Course textbook: Grolemund, G., & Wickham, H. (2017). R for Data Science. O’Reilly Media.