Cleaning and Analyzing Data on Museums in the U.S.

Final Project DACSS 601

Steph Roberts
2022-07-03

Introduction

Museums are an important part of community, culture, and an educational environment. The American Alliance of Museums (AAM) reports that museums support over 726,000 American jobs and contribute $50 billion to the US economy each year. The economic activity of museums generates over $12 billion in tax revenue, one-third of it going to state and local governments. Each job created by the museum sector results in $16,495 in additional tax revenue. Additionally, multiple studies and surveys about the best places to like in the US incorporate concentrations of accredited museums.

Places that both serve the public and partner with schools, Museums provide an educational environment for entire communities. Museums help teach the state and local curricula, tailoring their programs in math, science, art, literacy, language arts, history, civics and government, economics and financial literacy, geography, and social studies. AAM reports 97% of Americans believe that museums are educational assets for their communities. There are endless benefits to supporting museums in the US. One such way is the compilation and analysis of data to ensure grants can be given, provide tax-exemptions, and gain a better understanding of the locations and museum types that produce the highest revenue.

The following is an analysis of Museums, Aquariums, Zoos, and related organizations in the United State. The data includes information about each organization such as name, address, income, revenue, and discipline type. The discipline type is based on the National Taxonomy of Exempt Entities, which the National Center for Charitable Statistics and IRS use to classify nonprofit organizations. The data set is available on Kaggle and is under public domain. The data was originally compiled in 2014. The museum data was compiled from IMLS administrative records for discretionary grant recipients, IRS records for tax-exempt organizations, and private foundation grant recipients.

The research questions for which this analysis aims to provide insight include:
Question 1: What is the most common type of museum in the United States?
Question 2: Which states have the highest average revenue from museums?
Question 3: What types of museums produce the highest and lowest average revenue?
Question 4: What type of museum had the highest total income in 2014?
Question 5: What is the most common type of museum in each state?
Question 6: How does income relate to revenue among U.S. museums?
Question 7: What is the museum type with the highest total revenue?

Read in Data

First, R packages are loaded to aid in analysis.
To import the data, we can use read_csv to upload the file and assign it the name “mus.”

knitr::opts_chunk$set(echo = TRUE)

library(tidyverse)
library(dplyr)
library(usmap)
library(ggplot2)
library(plotly)
library(mapview)
library(forcats)
library(tigris, options(tigris_refresh=TRUE))
library(tidycensus)
library(colorspace)
library(scales)

#Import dataset
mus <- read_csv("museums.csv")

#View data frame
head(mus)
# A tibble: 6 × 25
  `Museum ID` `Museum Name`              `Legal Name` `Alternate Name`
        <dbl> <chr>                      <chr>        <chr>           
1  8400200098 ALASKA AVIATION HERITAGE … ALASKA AVIA… <NA>            
2  8400200117 ALASKA BOTANICAL GARDEN    ALASKA BOTA… <NA>            
3  8400200153 ALASKA CHALLENGER CENTER … ALASKA CHAL… <NA>            
4  8400200143 ALASKA EDUCATORS HISTORIC… ALASKA EDUC… <NA>            
5  8400200027 ALASKA HERITAGE MUSEUM     ALASKA AVIA… <NA>            
6  8400200096 ALASKA HISTORICAL MUSEUM   ALASKA HIST… <NA>            
# … with 21 more variables: `Museum Type` <chr>,
#   `Institution Name` <chr>,
#   `Street Address (Administrative Location)` <chr>,
#   `City (Administrative Location)` <chr>,
#   `State (Administrative Location)` <chr>,
#   `Zip Code (Administrative Location)` <dbl>,
#   `Street Address (Physical Location)` <chr>, …

Data cleaning

Edit columns

#Determine column names
colnames(mus)
 [1] "Museum ID"                               
 [2] "Museum Name"                             
 [3] "Legal Name"                              
 [4] "Alternate Name"                          
 [5] "Museum Type"                             
 [6] "Institution Name"                        
 [7] "Street Address (Administrative Location)"
 [8] "City (Administrative Location)"          
 [9] "State (Administrative Location)"         
[10] "Zip Code (Administrative Location)"      
[11] "Street Address (Physical Location)"      
[12] "City (Physical Location)"                
[13] "State (Physical Location)"               
[14] "Zip Code (Physical Location)"            
[15] "Phone Number"                            
[16] "Latitude"                                
[17] "Longitude"                               
[18] "Locale Code (NCES)"                      
[19] "County Code (FIPS)"                      
[20] "State Code (FIPS)"                       
[21] "Region Code (AAM)"                       
[22] "Employer ID Number"                      
[23] "Tax Period"                              
[24] "Income"                                  
[25] "Revenue"                                 

Rename

The stock column names include spaces and capital letters. For simplicity, we will amend the the column names to snake case.

#Rename columns
colnames(mus) <- c("museum_id", "museum_name", "legal_name", "alt_name", "museum_type", "inst_name", "admin_address", "admin_city", "admin_state", "admin_zip_code", "physical_address", "physical_city", "physical_state", "physical_zip_code", "phone_number", "latitude", "longitude", "locale_code", "county_code", "state_code", "region_code", "employer_id", "tax_period", "income", "revenue")

Eliminate

The data set includes many lines that are not needed for analysis, such as phone number and county code. We can drop the unneeded columns by selecting the ones we want to keep and then assigning to the dataframe.

keep_col <- c("museum_id","legal_name","museum_type","admin_city","admin_state","income","revenue")
mus <-mus[keep_col]

colnames(mus)
[1] "museum_id"   "legal_name"  "museum_type" "admin_city" 
[5] "admin_state" "income"      "revenue"    

Mutate

Museum type is an import column for our analysis, but it includes some very long character strings. We can edit this by using mutate and choosing new type names.

#Simplify museum type labels
mus <- mus%>%
  mutate(mus_type = case_when(
    (museum_type == "ARBORETUM, BOTANICAL GARDEN, OR NATURE CENTER") ~ "Garden",
    museum_type == "ART MUSEUM" ~ "Art",
    museum_type == "CHILDREN'S MUSEUM" ~ "Children",
    museum_type == "GENERAL MUSEUM" ~ "General",
    museum_type == "HISTORIC PRESERVATION" ~ "Preservation",
    museum_type == "HISTORY MUSEUM" ~ "History",
    museum_type == "NATURAL HISTORY MUSEUM" ~ "Natrual History",
    museum_type == "SCIENCE & TECHNOLOGY MUSEUM OR PLANETARIUM" ~ "Science",
    museum_type == "ZOO, AQUARIUM, OR WILDLIFE CONSERVATION " ~ "Zoo",
  )) 

#Check that museum types were edited
mus %>% 
  select(mus_type)%>%
  slice(1:10)
# A tibble: 10 × 1
   mus_type    
   <chr>       
 1 History     
 2 Garden      
 3 Science     
 4 Preservation
 5 History     
 6 Preservation
 7 General     
 8 Preservation
 9 General     
10 History     

During analysis of question 3, it was discovered that there are duplicate entries in our data. Before removing duplicates, Science was the museum with the highest average revenue, but only because the observation with its highest revenue was repeated 3 times. To ensure we are only analying one row for each museum, we can use eliminate duplicates.

#Remove duplicate rows
mus <- mus %>% 
  distinct(legal_name, .keep_all = TRUE)

#Verify the max revenue for science now only appears 1 time rather than 4
mus %>% 
  filter(mus_type == "Science") %>% 
  filter(revenue == max(revenue, na.rm = TRUE)) %>% 
  select(mus_type, revenue, legal_name)
# A tibble: 1 × 3
  mus_type    revenue legal_name                              
  <chr>         <dbl> <chr>                                   
1 Science  5840349457 PRESIDENT AND FELLOWS OF HARVARD COLLEGE
mus %>% 
  count(mus_type == "Science" & revenue == "5840349457")
# A tibble: 3 × 2
  `mus_type == "Science" & revenue == "5840349457"`     n
  <lgl>                                             <int>
1 FALSE                                             29616
2 TRUE                                                  1
3 NA                                                  813

Now the data frame contains only one observation for each museum.

Missing values

The data set has quite a few missing values that will interfere with any analysis. The reason for missing data may be linked to the process of data collection, which was conducted by a third party. It may be that some organizations were reported in a way that excluded pertinent information to this data, but fulfilled the necessary requirements for submission. To rectify the dataset, we can find and omit the rows with missing values.

#Identifying missing values
sum(is.na(mus))
[1] 20251
# A tibble: 1 × 8
  museum_id legal_name museum_type admin_city admin_state income
      <int>      <int>       <int>      <int>       <int>  <int>
1         0          0           0          0           0   9533
# … with 2 more variables: revenue <int>, mus_type <int>
#Remove missing values
mus <- mus %>% 
  drop_na(income, revenue, museum_type)

#Verify missing values are omitted
sum(is.na(mus))
[1] 287

There were 20251 missing or “NA” values, which were distributed among the income and revenue columns. It could be that income and revenue for some organizations may not have been calculated yet for the year or data collection.

Multiple rows of data include income or revenue values less than 0. The negative value data may be erroneous or due to non-profit public exhibits operating at a loss for the year of 2014. Negative values do not account for a huge portion of our data, but may be erroneous and cannot be verified. Unverified data may interfere with statistical analysis.

#Count number of rows in data frame
(row <- nrow(mus))
[1] 20263
#Count how many observations include a negative revenue
mus %>% 
  count(income < 0 | revenue < 0)
# A tibble: 2 × 2
  `income < 0 | revenue < 0`     n
  <lgl>                      <int>
1 FALSE                      20227
2 TRUE                          36
#Investigate which rows include negative revenue
mus %>%
  select(mus_type, admin_state, income, revenue) %>% 
  filter(income < 0 | revenue < 0)
# A tibble: 36 × 4
   mus_type     admin_state income revenue
   <chr>        <chr>        <dbl>   <dbl>
 1 Preservation AL           64294   -4221
 2 General      AR           56592   -2579
 3 Preservation AR           24000   -6000
 4 Art          AR           99811 -106157
 5 Preservation CA           60174   -3455
 6 General      CT            3733  -25400
 7 Science      DC           63401   -4997
 8 Science      FL          194240 -598813
 9 Preservation FL           39874   -8973
10 Preservation FL          165338  -14156
# … with 26 more rows

There are 36 observations where the revenue is listed as a negative number. We can remove those rows by filtering only the data we want to keep and creating a separate dataframe to work with for comparison purposes.

#Removing all rows with income and revenue less than 0
mus2 <- mus %>%
  filter((income >=0) & (revenue >=0))

#Verify all rows with negative revenue are eliminated
sum(mus2$revenue <0 | mus2$income <0, na.rm=TRUE)
[1] 0
#Verify only 36 rows were removed
row2 <- nrow(mus2)
row-row2
[1] 36

We now have a data frame names “mus” that has cleaned data and another dataframe called “mus2” that has the same data, but with no observations with negative income or revenue.

Exploration and Visualizations of the Data to answer key questions

Question 1: What is the most common type of museum in the United States?

#To find the most common museum type in the US we essential want the mode, but of a categorical variable
#Let's create a tibble of the museum type counts and frequencies 

mus %>%
  group_by(mus_type) %>%
  summarise(count=n()) %>%
  mutate(frequency = count/sum(count)) %>% 
  arrange(desc(count))
# A tibble: 9 × 3
  mus_type        count frequency
  <chr>           <int>     <dbl>
1 Preservation    11316    0.558 
2 General          3662    0.181 
3 History          1880    0.0928
4 Art              1565    0.0772
5 Garden            686    0.0339
6 Science           352    0.0174
7 Children          310    0.0153
8 <NA>              287    0.0142
9 Natrual History   205    0.0101

As we can see, Preservation, which stands for “HISTORIC PRESERVATION” in the original data, appears to be the most frequent museum type. The relative frequency shows us that Preservations account for over 56% of all museums accounted for in our data. Let’s visualize these categories.

#Let's check the same stats on the data frame that excludes negative income/revenue
mus2 %>%
  group_by(mus_type) %>%
  summarise(count=n()) %>%
  mutate(frequency = count/sum(count)) %>% 
  arrange(desc(count))
# A tibble: 9 × 3
  mus_type        count frequency
  <chr>           <int>     <dbl>
1 Preservation    11305    0.559 
2 General          3648    0.180 
3 History          1874    0.0926
4 Art              1564    0.0773
5 Garden            685    0.0339
6 Science           349    0.0173
7 Children          310    0.0153
8 <NA>              287    0.0142
9 Natrual History   205    0.0101

The frequency of museum type is barely affected by the removal of the negative numbers. Preservation museuems remains roughly 56% of the data. Perhaps removing the negative values is unnecessary if it is a small number of observations and the values may be true data. We can explore this more further on.

#To Visualize the frequency of museum types, a single categorical variable, we can create a bar chart
mus%>%
  drop_na(mus_type)%>%
  ggplot(mus, mapping = aes(x=mus_type, fill=mus_type))+
  geom_bar()+
  labs (title="Total U.S. Museum Types", fill = "Museum Type")+
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
      axis.ticks.x=element_blank())

The bar graph visually verifies that (historic) preservations are the most common type of museum in the US. However, it does also reveal that Preservations are disproportionately the most common. The term Historic Preservation refers to a location or building(s) that preserves the history of the area and can often be viewed or explored by the public. That could be due to the fact that there are truly many more historic prevervations, or due to the data collection process, which may have grouped many organizations into one group if they did not fit into other categories.

To investigate our research question slightly further, let us ask a related question. What are the top 3 most common museum types in the U.S.?

#First, we can create a new column with counts of each museum type category
mus$mus_type <- as.character(mus$mus_type)
mus$count <- as.numeric(ave(mus$mus_type, mus$mus_type, FUN = length))

#Then, we can group and arrange to find the top 3
(mt <- mus %>%
  select(count, mus_type)%>%
  group_by(mus_type) %>%
  arrange(desc(count)) %>%
  slice(1)%>%
  ungroup()%>%
  arrange(desc(count))%>%
  slice(1:3))
# A tibble: 3 × 2
  count mus_type    
  <dbl> <chr>       
1 11316 Preservation
2  3662 General     
3  1880 History     
mt%>%
  mutate(mus_type = fct_reorder(mus_type, count)) %>% 
  ggplot(., mapping = aes(x=mus_type, y=count, color=mus_type))+
  geom_bar(stat='identity', fill= "white")+
  labs (x="Museum Type", y = "Number of Museums", title="Top 3 Most Common U.S. Museum Types")+
  theme(legend.position="none")  

This shows us the General and History museum categories are the second and third most common, respectively, in the U.S. after Preservations.

Question 2: Which states have the highest average revenue from museums?

#Find mean revenue of each state
(state_rev <- mus %>%
  group_by(admin_state) %>%
  select(revenue) %>%
  summarize_all(mean, na.rm = TRUE)%>%
  arrange(desc(revenue)))
# A tibble: 51 × 2
   admin_state   revenue
   <chr>           <dbl>
 1 DC          90133011.
 2 AZ          25296830.
 3 MD          18554123.
 4 RI          16997959.
 5 TN          16456641.
 6 CT          14957386.
 7 NY          14839249.
 8 DE          14606788.
 9 IL          14208026.
10 CA          13510308.
# … with 41 more rows

The tibble indicates DC, AZ, and MD are the locations with the highest revenue in the U.s.. To visualize this, we can create a histogram of the average revenue to see the distribution. Will the removal of negative values in “mus2” drastically affect the average revenue by states?

#Investigate mus2 with this question
mus2 %>%
  group_by(admin_state) %>%
  select(revenue) %>%
  summarize_all(mean, na.rm = TRUE)%>%
  arrange(desc(revenue))
# A tibble: 51 × 2
   admin_state   revenue
   <chr>           <dbl>
 1 DC          91134544.
 2 AZ          25296830.
 3 MD          18604679.
 4 RI          16997959.
 5 TN          16456641.
 6 CT          14999591.
 7 NY          14882038.
 8 DE          14606788.
 9 IL          14225965.
10 CA          13518775.
# … with 41 more rows

As noted earlier, the absence of negative income and revenue did not affected the revenue average distributions among states in any significant way. From this point on, we will stick with our original cleaned data and use only “mus” as our data frame.

#Create a histogram
ggplot(state_rev, aes(revenue))+
  geom_histogram(position="identity", color = 4, fill = "white", bins = 50)+
  labs (x="Revenue", y = "Density", title="Average Museum Revenues of U.S. States")+
  theme(legend.position="none")

The histogram shows us there are two major outliers. Let’s find out what locations they are and then remove them to visualize the remaining states better.

#Recall the revenues of the top 3 states
state_rev[1:3,]
# A tibble: 3 × 2
  admin_state   revenue
  <chr>           <dbl>
1 DC          90133011.
2 AZ          25296830.
3 MD          18554123.

This shows us DC has the highest average revenue for museums, at $90 milion, which means it is represented in the farthest right histogram column in our last visual. Though we should keep this observation in mind, it will be helpful to remove it for a zoomed-in look at the remaining data. We can use the revenue from our 2nd highest, AZ at 2.6e+07, to help filter our data for an improved visual.

#Filter only the states with an average revenue of under 75,000,000.
state_rev_fil <- filter(state_rev, revenue < 3e+07)

#Remake a histogram with the filtered data
ggplot(state_rev_fil, aes(revenue))+
  geom_histogram(position="identity", color = 2, fill = "orange", bins = 50)+
  labs (x="Revenue", y = "Density", title="Average Museum Revenues of U.S. States")+
  theme(legend.position="none")

That is already much better. But since the data includes locations, it may be best visualized another way. To visualize revenue by state, which is a continuous variable by a discrete, we can create a map.

#Pull and rename state map data
usmap <- map_data("state")

#Create a column to match map data
mus$region <- mus$admin_state

#Join map data with museum data
museum_map <- full_join(mus, usmap, by = "region")

#Create new data frame
state_rev <- as.data.frame(state_rev)

#Create new column names "state"
state_rev$state <- state_rev$admin_state

#Calculate percentiles for defining legend
x <- state_rev_fil$revenue
quantile(x, probs = seq(0, 1, 0.25))
        0%        25%        50%        75%       100% 
  110453.5  1891862.5  4063450.9  9040423.3 25296830.4 
#Create a vector to define legend
rev_breaks <- c(1.8e+06, 3.9e+06, 9e+06, 2.6e+07)

#Create a map
plot_usmap(
  regions = "states", data = state_rev, values = "revenue")+
  theme(panel.background = element_rect(colour = "black"))+
  scale_fill_continuous(low = "blue", high ="red", 
                          name = "Mean Revenue",label = scales::comma,
                          limits = c(0,3e+07)) + 
  theme(legend.position = "right")+
  labs (title="Average Museum Revenues of U.S. States")

One limitation of this visual lies in the outlier being extracted from the color gradient. DC, the location with the highest average revenue is not highlighted in the color scale because the number skews the data in a way that obscures the visualization of the revenue in all other states. This map visual does, however, allow us to quickly assess which the distribution of average museum revenue by state.

#Create a column bar graph
ggplot(state_rev, mapping = aes(x = state, y = revenue, color = state))+
  geom_col(width = 0.75, position = position_dodge(0.7), fill= "white")+
  labs (x="State", y = "Revenue", title = "Average Revenue by State")+
  theme (legend.position = "none", axis.text.x = element_text(margin = unit(c(0, 0, 0, 0), "mm"), angle = 90))

The column bar graph can illustrate state-by-state averages more clearly. We can see that DC and has the highest average revenue by far at a quick glance, with AZ next behind it.

An attempt to create a boxplot or violin plot was made with this data grouping. However, the state_rev data group contains only one numeric value - the mean revenue of each state. While box and violin plots show a range of values, we can investigate total revenue by state rather than the average.

#Investigate total revenue by state
state_total <- as.data.frame(mus %>%
  group_by (admin_state) %>%
  mutate (total_rev = sum(revenue)) %>%
  distinct (admin_state, total_rev) %>% 
  select (admin_state, total_rev) %>% 
  arrange(desc(total_rev)))
head(state_total)
  admin_state   total_rev
1          CA 21575961438
2          NY 21324000468
3          IL 11266964329
4          PA 10187718826
5          DC  8202103987
6          MD  6827917194

This shows the total revenue of CA, NY, and IL are the highest states for revenue production, respectively, from museums in 2014. Let’s visualie this.

#Create new column to match map column
state_total$state <- state_total$admin_state

#Create a map of TOTAL revenue by state
plot_usmap(
  regions = "states", data = state_total, values = "total_rev")+
  theme(panel.background = element_rect(colour = "black"))+
  scale_fill_continuous(low = "blue", high ="red", 
                          name = "Total Revenue",label = scales::comma,) + 
  theme(legend.position = "right")+
  labs (title="Total Revenues of Museums by U.S. States")

This visual may be more practical in terms of learning about states across the country and their museum-to-community interaction. A higher total revenue may indicate higher turnout on big exhibit events, more interationc with the community on a daily basis, or perhaps more donations from from the public and coorporations.

#Create a boxplot of all revenue observations by state
ggplot(mus, mapping = aes(x = admin_state, y = revenue, color = admin_state))+
  geom_boxplot()+
  labs (x="State", y = "Revenue", title = "Revenue by State")+
  theme (legend.position = "none", axis.text.x = element_text(margin = unit(c(0, 0, 0, 0), "mm"), angle = 90))

This visual shows that AZ has few points, but the highest outlier, which could explain why it is one of the highest mean revenue. It also shows that CA has several high value outliers that explain its position as the highest total revenue state. But why are the boxes flat?

#Count number of "0" values are in revenue
mus %>% 
  count(revenue == "0")
# A tibble: 2 × 2
  `revenue == "0"`     n
  <lgl>            <int>
1 FALSE            10115
2 TRUE             10148

This explains why the boxes in the boxplot are flat. There are more rows with 0 revenue than with a positive number.

This may raise a new question. How is museum type distributed among the top 3 states with the highest total revenue?

#Count types of museums in the top revenue states
mus %>%
  select (admin_state, mus_type) %>% 
  filter(admin_state == "CA" | admin_state == "NY" | admin_state == "IL") %>% 
  count(mus_type)
# A tibble: 9 × 2
  mus_type            n
  <chr>           <int>
1 Art               359
2 Children           72
3 Garden            145
4 General           706
5 History           339
6 Natrual History    43
7 Preservation     2047
8 Science            64
9 <NA>               52
#Assign total revenue column to data frame
mus_rev <- mus %>%
  group_by (admin_state) %>%
  mutate (total_rev = sum(revenue))

#Filter top 3 states, graph by revenue and museum type  
mus_rev %>%
  select (admin_state, revenue, total_rev, mus_type) %>%
  filter(admin_state == "CA" | admin_state == "NY" | admin_state == "IL") %>% 
  group_by (admin_state) %>%
  arrange(desc(total_rev)) %>% 
  ggplot( mapping = aes(x = admin_state, y = revenue, fill = mus_type))+
  geom_col()+
  labs (x="State", y = "Revenue", title = "Highest total Revenue States and Musteum Type")+
  theme (axis.text.x = element_text(margin = unit(c(0, 0, 0, 0), "mm"), angle = 90))

This visual shows that Art museums account for the highest amount of revenue in CA, IL, and NY. Interestingly we can now see that history museums contribute a large amount of revenue for NY compared with the other two states. The number of history museums in CA, IL, and NY are 151, 66, and 122, respectively. Which means NY has fewer history museums, but generates a higher revenue from them than CA. Perhaps NY charges more for admission or is given more donations that are counted as revenue.

Question 3: What types of museums produce the highest and lowest average revenue?

#Find mean revenue by museum type
top_rev <- mus %>%
  group_by(mus_type) %>%
  select(revenue, mus_type) %>%
  summarize_all(mean, na.rm = TRUE) %>%
  ungroup() %>%
  arrange(desc(revenue)) %>%
  slice(1:3)

top_rev
# A tibble: 3 × 2
  mus_type   revenue
  <chr>        <dbl>
1 Art      58330122.
2 Science  49744546.
3 Garden   21137443.

The museum types with the highest average revenue are Art, Science, and Garden museums, respectively.

low_rev <- mus %>%
  group_by(mus_type) %>%
  select(revenue, mus_type) %>%
  summarize_all(mean) %>%
  ungroup() %>%
  arrange(revenue) %>%
  slice(1:3)

low_rev
# A tibble: 3 × 2
  mus_type      revenue
  <chr>           <dbl>
1 Preservation  842426.
2 Children     1249144.
3 General      4505061.

The museum types with the lowest average revenue are Preservation, Children, and General museums, respectively.

#Create a boxplot visual
mus%>%
  drop_na (mus_type) %>%
  ggplot (., mapping = aes(x = revenue, y = mus_type, fill = mus_type))+
  geom_boxplot(outlier.colour= NULL)+
  labs (title = "Revenue by Museum type", x = "Revenue", y = "Museum Type")+
  theme (legend.position="none")
#Investigate how many museums have a revenue of 0
(zero <- sum(mus$revenue == "0"))
[1] 10148
#COunt number of total observations
nrow(mus)
[1] 20263
#Calculate percent of observations with a "0" revenue
percent((zero)/(nrow(mus)), accuracy = 0.01)
[1] "50.08%"

Art museums have the highest average revenue, while Preservation museums have the lowest. This boxplot accurately portrays that relationship, with more high outliers in Art than any other category and fewer in Preservation. The appearance of flattened boxes is explained by the high volume of museums, 10,148 to be exact, that reported a revenue of “0”.

Question 4: What type of museum had the highest total income in 2014?

#Create a data frame to work with
type_income <- data.frame(mus %>% 
  select(mus_type, income) %>%
  group_by(mus_type) %>% 
  mutate(total_income = sum(income))%>% #New column with total income
  arrange(desc(total_income)))  #Arrange by highest total income

#Confirm the new object is a data frame and not a tibble
is.data.frame(type_income)
[1] TRUE
#Create visual of total income vs museum type
type_income %>% 
  mutate(mus_type = fct_reorder(mus_type, total_income)) %>% 
  ggplot(mapping = aes(x = mus_type, y = total_income, color = mus_type))+
  geom_col(width = 0.75, position = position_dodge(0.7), fill= "white")+
  labs (x="State", y = "Revenue", title = "Total income by Museum type")+
  theme (legend.position = "none", axis.text.x = element_text(margin = unit(c(0, 0, 0, 0), "mm"), angle = 90))

Art museums had the highest total income in 2014 in the U.S..

Question 5: What is the most common type of museum in each state?

#Find most common museum type by state
state_type <- mus %>% 
    group_by(admin_state) %>% 
    count(mus_type) %>% 
    slice(which.max(n)) %>% 
    select(-n)

#Create data frame
state_type_df <- as.data.frame(state_type)
head(state_type_df)
  admin_state     mus_type
1          AK Preservation
2          AL Preservation
3          AR Preservation
4          AZ Preservation
5          CA Preservation
6          CO Preservation
#Confirm the data includes DC as a state by a count
length(state_type_df$admin_state)
[1] 51
#Count number of states in which Preservation is the most common
length(which(state_type_df$mus_type == "Preservation"))
[1] 49

As expected, given our understanding of the most common type of museum, the Preservation museum type is the most common in all but 2 states.

#Identify the states where Preservation is not the most common
state_type_df %>% 
  select(admin_state, mus_type) %>% 
  filter(mus_type != "Preservation")
  admin_state mus_type
1          UT  General
2          WY  General

Wyoming and DC are the “states” that do not have Preservations are their most common museum. Interestingly, they share the most common type, which is General museums.

Question 6: How does income relate to revenue among U.S. museums?

#Create a scatterplot of museum income vs. revenue.
mus %>% 
  ggplot(.,aes(x = income, y = revenue))+
  geom_point(size = 3, shape = 23)+
  geom_smooth()+
  labs (x="Income", y = "Revenue", title = "U.S. Museum 2014 Income vs. Revenue")

The above scatterplot shows a strong positive, linear association between the museums’ incomes and revenues with a number of potential outliers.

#We can remove the far most outlier to zoom in a bit
#Compute interquartile range to identify outliers
Q1 <- quantile(mus$income, .25)
Q3 <- quantile(mus$income, .75)
(IQR <- IQR(mus$income))
[1] 171649
(low_lim = Q1 - 1.5 * IQR)
      25% 
-257473.5 
(up_lim = Q3 + 1.5 * IQR)
     75% 
429122.5 
#Create tibble to view the outliers
outliers <- mus %>% 
  select(mus_type, admin_state, income) %>%
  filter(income > up_lim | income < low_lim) %>% 
  arrange(income)
outliers
# A tibble: 3,459 × 3
   mus_type     admin_state income
   <chr>        <chr>        <dbl>
 1 History      MA          429295
 2 History      DC          429387
 3 Preservation MI          429604
 4 General      OH          429694
 5 Children     TN          429888
 6 History      OK          430316
 7 General      MN          430400
 8 General      MA          430419
 9 History      TX          430483
10 General      NJ          430567
# … with 3,449 more rows
#Calculate percent of data we would drop if we extract outliers
percent((nrow(outliers))/(nrow(mus)), accuracy = 0.01)
[1] "17.07%"
#Remove the outliers
income_fil <- as.data.frame (mus %>% 
  filter(income < up_lim & income > low_lim))

income_fil %>% 
  ggplot(.,aes(x = income, y = revenue))+
  geom_point(size = 2, shape = 23, color = 6)+
  geom_smooth()+
  labs (x="Income", y = "Revenue", title = "Outlier Removed - U.S. Museum Income vs. Revenue")

Extracing the statistical outliers has returned a subset of the data that has a very strong linear and slightly positive correlation between income and revenue.

Question 7: What is the museum type with the highest total revenue?

#Create data frame with summarized revenue
total_rev <- as.data.frame(mus %>%
  select(mus_type, admin_state, revenue, admin_state) %>%
  group_by(mus_type) %>%
  mutate(total_revenue = sum(revenue)) %>% 
  arrange(desc(total_revenue))) %>% 
  ungroup()%>%
  arrange(desc(total_revenue))

#Column chart with total revenue by museum type
total_rev %>% 
  mutate(mus_type = fct_reorder(mus_type, total_revenue)) %>%
  ggplot(aes(mus_type, total_revenue, fill = mus_type))+
  geom_col()+
  theme(axis.title.x = element_blank())+
  labs (x="Museum Type", y = "Total Revenue", title = "Total Revenue by of U.S. Museum Types", fill = "Museum Type")+
  theme(axis.text.x=element_blank(),
      axis.ticks.x=element_blank())
#Verify the top three types by revenue
total_rev %>% 
  group_by(mus_type) %>%
  summarize(tot_rev = sum(revenue)) %>% 
  arrange(desc(tot_rev)) %>% 
  slice (1:3)
# A tibble: 3 × 2
  mus_type     tot_rev
  <chr>          <dbl>
1 Art      91286640832
2 Science  17510080041
3 General  16497532030

The museum type with the highest total revenue from 2014 is Art Museums, followed by Science Museums and General museums.

Reflection

A significant amount of time was spent analyzing question 6 much deeper. Namely, the statistics which were used to identify the outliers was a learning curve. However, the end result satisfies the goal well. Additionally, a substantial amount of time was spent find the best way to adjust the range on the map to properly show the distribution of values after figuring the best way to filter out the top most outlier, which was skewing the original scale. The 2 states whose most common museum is not preservation was identified. Additional notes were made regarding missing values, including theorizing the cause and effect of whether to leave them, as well as adding computations to support the ideas.

During a deeper analysis of question 3, it was discovered that there were duplicate entries in the orinigal data. Before removing duplicates, Science was the museum with the highest average revenue, but only because the observation with its highest revenue was repeated 3 times. After removing duplicates, Art was found to be the hightest average revenu generator. This problem fix also changed answers to other questions, including question 2 which asked which states have the highest average revenue. With duplicates, MA was the second highest by a far margin outside the rest of the data. After removing duplicates, it fell to 13th. So, removing duplicates has a significant effect on mean calculations. It also meant there was only 1 outlier to adjust for when visualizing questions 2, 3, and 6. Removing the duplicates was a game-changer. It also drastically altered the final scatter plot in question 6. While the trend is similar (linear and positve), the degree to which the relationship is positive is much less steep than before the data was fully cleaned.

Going forward, checking for and removing true duplicates will be a part of every data cleaning project.

Conclusion

Keep in mind DC is counted as a state for purposes of this data analysis.

Most common type of museum in the U.S.: Preservations, with a total count of 11,316, accounting for 56% of all U.S. museums.

2nd and 3rd most common museum types in the U.S.: General museums are 2nd most common with 3,662 in total and 18% of all museums. History museums are in 3rd with 1,880 total museums and accounting for 9% of all U.S. museums.

States with the highest AVERAGE revenue: DC is 1st with $90,133,011, Arionza is 2nd with $25,296,830, and Maryland is 3rd with $18,554,123.

States with ther highest TOTAL revenue: CA is 1st with $21,575,961,438, NY is 2nd with $21,324,000,468, and IL is 3rd with $11,266,964,329.

Museum type with the highest average revenue: Art Museums. Type of museum with the highest total income: Art Museums. Type of museum with the highest total revenue: Art Museums.

Museum type with the lowest average revenue: Historic Preservation Museums.

Number of museums with a revenue of “0”: 10,148 out of 20,263 museums, or 50.08% of the data.

Most common type of museum in each state: Historic Preservation is the most common type in 9 out of the 51 (including DC) states. Utah and Wyoming’s most common museum was General museums.

Percent of outliers calculated by IQR: 17.07% This amount of data was removed to visualize income vs. revenue, which is a weakly positive, but strongly linear correlation.

Bibliography

R: R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

museum information: https://www.aam-us.org/programs/about-museums/museum-facts-data/

kaggle: https://www.kaggle.com/code/priankravichandar/museums-data-analysis

Course textbook: Grolemund, G., & Wickham, H. (2017). R for Data Science. O’Reilly Media.