Final Project DACSS 601
Museums are an important part of community, culture, and an educational environment. The American Alliance of Museums (AAM) reports that museums support over 726,000 American jobs and contribute $50 billion to the US economy each year. The economic activity of museums generates over $12 billion in tax revenue, one-third of it going to state and local governments. Each job created by the museum sector results in $16,495 in additional tax revenue. Additionally, multiple studies and surveys about the best places to like in the US incorporate concentrations of accredited museums.
Places that both serve the public and partner with schools, Museums provide an educational environment for entire communities. Museums help teach the state and local curricula, tailoring their programs in math, science, art, literacy, language arts, history, civics and government, economics and financial literacy, geography, and social studies. AAM reports 97% of Americans believe that museums are educational assets for their communities. There are endless benefits to supporting museums in the US. One such way is the compilation and analysis of data to ensure grants can be given, provide tax-exemptions, and gain a better understanding of the locations and museum types that produce the highest revenue.
The following is an analysis of Museums, Aquariums, Zoos, and related organizations in the United State. The data includes information about each organization such as name, address, income, revenue, and discipline type. The discipline type is based on the National Taxonomy of Exempt Entities, which the National Center for Charitable Statistics and IRS use to classify nonprofit organizations. The data set is available on Kaggle and is under public domain. The data was originally compiled in 2014. The museum data was compiled from IMLS administrative records for discretionary grant recipients, IRS records for tax-exempt organizations, and private foundation grant recipients.
The research questions for which this analysis aims to
provide insight include:
Question 1: What is the most common type of museum in the United
States?
Question 2: Which states have the highest average revenue from
museums?
Question 3: What types of museums produce the highest and lowest average
revenue?
Question 4: What type of museum had the highest total income in
2014?
Question 5: What is the most common type of museum in each state?
Question 6: How does income relate to revenue among U.S. museums?
Question 7: What is the museum type with the highest total revenue?
First, R packages are loaded to aid in analysis.
To import the data, we can use read_csv to upload the file and assign it
the name “mus.”
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(dplyr)
library(usmap)
library(ggplot2)
library(plotly)
library(mapview)
library(forcats)
library(tigris, options(tigris_refresh=TRUE))
library(tidycensus)
library(colorspace)
library(scales)
#Import dataset
mus <- read_csv("museums.csv")
#View data frame
head(mus)
# A tibble: 6 × 25
`Museum ID` `Museum Name` `Legal Name` `Alternate Name`
<dbl> <chr> <chr> <chr>
1 8400200098 ALASKA AVIATION HERITAGE … ALASKA AVIA… <NA>
2 8400200117 ALASKA BOTANICAL GARDEN ALASKA BOTA… <NA>
3 8400200153 ALASKA CHALLENGER CENTER … ALASKA CHAL… <NA>
4 8400200143 ALASKA EDUCATORS HISTORIC… ALASKA EDUC… <NA>
5 8400200027 ALASKA HERITAGE MUSEUM ALASKA AVIA… <NA>
6 8400200096 ALASKA HISTORICAL MUSEUM ALASKA HIST… <NA>
# … with 21 more variables: `Museum Type` <chr>,
# `Institution Name` <chr>,
# `Street Address (Administrative Location)` <chr>,
# `City (Administrative Location)` <chr>,
# `State (Administrative Location)` <chr>,
# `Zip Code (Administrative Location)` <dbl>,
# `Street Address (Physical Location)` <chr>, …
#Determine column names
colnames(mus)
[1] "Museum ID"
[2] "Museum Name"
[3] "Legal Name"
[4] "Alternate Name"
[5] "Museum Type"
[6] "Institution Name"
[7] "Street Address (Administrative Location)"
[8] "City (Administrative Location)"
[9] "State (Administrative Location)"
[10] "Zip Code (Administrative Location)"
[11] "Street Address (Physical Location)"
[12] "City (Physical Location)"
[13] "State (Physical Location)"
[14] "Zip Code (Physical Location)"
[15] "Phone Number"
[16] "Latitude"
[17] "Longitude"
[18] "Locale Code (NCES)"
[19] "County Code (FIPS)"
[20] "State Code (FIPS)"
[21] "Region Code (AAM)"
[22] "Employer ID Number"
[23] "Tax Period"
[24] "Income"
[25] "Revenue"
The stock column names include spaces and capital letters. For simplicity, we will amend the the column names to snake case.
#Rename columns
colnames(mus) <- c("museum_id", "museum_name", "legal_name", "alt_name", "museum_type", "inst_name", "admin_address", "admin_city", "admin_state", "admin_zip_code", "physical_address", "physical_city", "physical_state", "physical_zip_code", "phone_number", "latitude", "longitude", "locale_code", "county_code", "state_code", "region_code", "employer_id", "tax_period", "income", "revenue")
The data set includes many lines that are not needed for analysis, such as phone number and county code. We can drop the unneeded columns by selecting the ones we want to keep and then assigning to the dataframe.
keep_col <- c("museum_id","legal_name","museum_type","admin_city","admin_state","income","revenue")
mus <-mus[keep_col]
colnames(mus)
[1] "museum_id" "legal_name" "museum_type" "admin_city"
[5] "admin_state" "income" "revenue"
Museum type is an import column for our analysis, but it includes some very long character strings. We can edit this by using mutate and choosing new type names.
#Simplify museum type labels
mus <- mus%>%
mutate(mus_type = case_when(
(museum_type == "ARBORETUM, BOTANICAL GARDEN, OR NATURE CENTER") ~ "Garden",
museum_type == "ART MUSEUM" ~ "Art",
museum_type == "CHILDREN'S MUSEUM" ~ "Children",
museum_type == "GENERAL MUSEUM" ~ "General",
museum_type == "HISTORIC PRESERVATION" ~ "Preservation",
museum_type == "HISTORY MUSEUM" ~ "History",
museum_type == "NATURAL HISTORY MUSEUM" ~ "Natrual History",
museum_type == "SCIENCE & TECHNOLOGY MUSEUM OR PLANETARIUM" ~ "Science",
museum_type == "ZOO, AQUARIUM, OR WILDLIFE CONSERVATION " ~ "Zoo",
))
#Check that museum types were edited
mus %>%
select(mus_type)%>%
slice(1:10)
# A tibble: 10 × 1
mus_type
<chr>
1 History
2 Garden
3 Science
4 Preservation
5 History
6 Preservation
7 General
8 Preservation
9 General
10 History
During analysis of question 3, it was discovered that there are duplicate entries in our data. Before removing duplicates, Science was the museum with the highest average revenue, but only because the observation with its highest revenue was repeated 3 times. To ensure we are only analying one row for each museum, we can use eliminate duplicates.
#Remove duplicate rows
mus <- mus %>%
distinct(legal_name, .keep_all = TRUE)
#Verify the max revenue for science now only appears 1 time rather than 4
mus %>%
filter(mus_type == "Science") %>%
filter(revenue == max(revenue, na.rm = TRUE)) %>%
select(mus_type, revenue, legal_name)
# A tibble: 1 × 3
mus_type revenue legal_name
<chr> <dbl> <chr>
1 Science 5840349457 PRESIDENT AND FELLOWS OF HARVARD COLLEGE
# A tibble: 3 × 2
`mus_type == "Science" & revenue == "5840349457"` n
<lgl> <int>
1 FALSE 29616
2 TRUE 1
3 NA 813
Now the data frame contains only one observation for each museum.
The data set has quite a few missing values that will interfere with any analysis. The reason for missing data may be linked to the process of data collection, which was conducted by a third party. It may be that some organizations were reported in a way that excluded pertinent information to this data, but fulfilled the necessary requirements for submission. To rectify the dataset, we can find and omit the rows with missing values.
[1] 20251
mus %>%
select(everything()) %>%
summarise_all(funs(sum(is.na(.))))
# A tibble: 1 × 8
museum_id legal_name museum_type admin_city admin_state income
<int> <int> <int> <int> <int> <int>
1 0 0 0 0 0 9533
# … with 2 more variables: revenue <int>, mus_type <int>
#Remove missing values
mus <- mus %>%
drop_na(income, revenue, museum_type)
#Verify missing values are omitted
sum(is.na(mus))
[1] 287
There were 20251 missing or “NA” values, which were distributed among the income and revenue columns. It could be that income and revenue for some organizations may not have been calculated yet for the year or data collection.
Multiple rows of data include income or revenue values less than 0. The negative value data may be erroneous or due to non-profit public exhibits operating at a loss for the year of 2014. Negative values do not account for a huge portion of our data, but may be erroneous and cannot be verified. Unverified data may interfere with statistical analysis.
#Count number of rows in data frame
(row <- nrow(mus))
[1] 20263
# A tibble: 2 × 2
`income < 0 | revenue < 0` n
<lgl> <int>
1 FALSE 20227
2 TRUE 36
#Investigate which rows include negative revenue
mus %>%
select(mus_type, admin_state, income, revenue) %>%
filter(income < 0 | revenue < 0)
# A tibble: 36 × 4
mus_type admin_state income revenue
<chr> <chr> <dbl> <dbl>
1 Preservation AL 64294 -4221
2 General AR 56592 -2579
3 Preservation AR 24000 -6000
4 Art AR 99811 -106157
5 Preservation CA 60174 -3455
6 General CT 3733 -25400
7 Science DC 63401 -4997
8 Science FL 194240 -598813
9 Preservation FL 39874 -8973
10 Preservation FL 165338 -14156
# … with 26 more rows
There are 36 observations where the revenue is listed as a negative number. We can remove those rows by filtering only the data we want to keep and creating a separate dataframe to work with for comparison purposes.
#Removing all rows with income and revenue less than 0
mus2 <- mus %>%
filter((income >=0) & (revenue >=0))
#Verify all rows with negative revenue are eliminated
sum(mus2$revenue <0 | mus2$income <0, na.rm=TRUE)
[1] 0
#Verify only 36 rows were removed
row2 <- nrow(mus2)
row-row2
[1] 36
We now have a data frame names “mus” that has cleaned data and another dataframe called “mus2” that has the same data, but with no observations with negative income or revenue.
#To find the most common museum type in the US we essential want the mode, but of a categorical variable
#Let's create a tibble of the museum type counts and frequencies
mus %>%
group_by(mus_type) %>%
summarise(count=n()) %>%
mutate(frequency = count/sum(count)) %>%
arrange(desc(count))
# A tibble: 9 × 3
mus_type count frequency
<chr> <int> <dbl>
1 Preservation 11316 0.558
2 General 3662 0.181
3 History 1880 0.0928
4 Art 1565 0.0772
5 Garden 686 0.0339
6 Science 352 0.0174
7 Children 310 0.0153
8 <NA> 287 0.0142
9 Natrual History 205 0.0101
As we can see, Preservation, which stands for “HISTORIC PRESERVATION” in the original data, appears to be the most frequent museum type. The relative frequency shows us that Preservations account for over 56% of all museums accounted for in our data. Let’s visualize these categories.
#Let's check the same stats on the data frame that excludes negative income/revenue
mus2 %>%
group_by(mus_type) %>%
summarise(count=n()) %>%
mutate(frequency = count/sum(count)) %>%
arrange(desc(count))
# A tibble: 9 × 3
mus_type count frequency
<chr> <int> <dbl>
1 Preservation 11305 0.559
2 General 3648 0.180
3 History 1874 0.0926
4 Art 1564 0.0773
5 Garden 685 0.0339
6 Science 349 0.0173
7 Children 310 0.0153
8 <NA> 287 0.0142
9 Natrual History 205 0.0101
The frequency of museum type is barely affected by the removal of the negative numbers. Preservation museuems remains roughly 56% of the data. Perhaps removing the negative values is unnecessary if it is a small number of observations and the values may be true data. We can explore this more further on.
#To Visualize the frequency of museum types, a single categorical variable, we can create a bar chart
mus%>%
drop_na(mus_type)%>%
ggplot(mus, mapping = aes(x=mus_type, fill=mus_type))+
geom_bar()+
labs (title="Total U.S. Museum Types", fill = "Museum Type")+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
The bar graph visually verifies that (historic) preservations are the most common type of museum in the US. However, it does also reveal that Preservations are disproportionately the most common. The term Historic Preservation refers to a location or building(s) that preserves the history of the area and can often be viewed or explored by the public. That could be due to the fact that there are truly many more historic prevervations, or due to the data collection process, which may have grouped many organizations into one group if they did not fit into other categories.
To investigate our research question slightly further, let us ask a related question. What are the top 3 most common museum types in the U.S.?
#First, we can create a new column with counts of each museum type category
mus$mus_type <- as.character(mus$mus_type)
mus$count <- as.numeric(ave(mus$mus_type, mus$mus_type, FUN = length))
#Then, we can group and arrange to find the top 3
(mt <- mus %>%
select(count, mus_type)%>%
group_by(mus_type) %>%
arrange(desc(count)) %>%
slice(1)%>%
ungroup()%>%
arrange(desc(count))%>%
slice(1:3))
# A tibble: 3 × 2
count mus_type
<dbl> <chr>
1 11316 Preservation
2 3662 General
3 1880 History
mt%>%
mutate(mus_type = fct_reorder(mus_type, count)) %>%
ggplot(., mapping = aes(x=mus_type, y=count, color=mus_type))+
geom_bar(stat='identity', fill= "white")+
labs (x="Museum Type", y = "Number of Museums", title="Top 3 Most Common U.S. Museum Types")+
theme(legend.position="none")
This shows us the General and History museum categories are the second and third most common, respectively, in the U.S. after Preservations.
#Find mean revenue of each state
(state_rev <- mus %>%
group_by(admin_state) %>%
select(revenue) %>%
summarize_all(mean, na.rm = TRUE)%>%
arrange(desc(revenue)))
# A tibble: 51 × 2
admin_state revenue
<chr> <dbl>
1 DC 90133011.
2 AZ 25296830.
3 MD 18554123.
4 RI 16997959.
5 TN 16456641.
6 CT 14957386.
7 NY 14839249.
8 DE 14606788.
9 IL 14208026.
10 CA 13510308.
# … with 41 more rows
The tibble indicates DC, AZ, and MD are the locations with the highest revenue in the U.s.. To visualize this, we can create a histogram of the average revenue to see the distribution. Will the removal of negative values in “mus2” drastically affect the average revenue by states?
#Investigate mus2 with this question
mus2 %>%
group_by(admin_state) %>%
select(revenue) %>%
summarize_all(mean, na.rm = TRUE)%>%
arrange(desc(revenue))
# A tibble: 51 × 2
admin_state revenue
<chr> <dbl>
1 DC 91134544.
2 AZ 25296830.
3 MD 18604679.
4 RI 16997959.
5 TN 16456641.
6 CT 14999591.
7 NY 14882038.
8 DE 14606788.
9 IL 14225965.
10 CA 13518775.
# … with 41 more rows
As noted earlier, the absence of negative income and revenue did not affected the revenue average distributions among states in any significant way. From this point on, we will stick with our original cleaned data and use only “mus” as our data frame.
#Create a histogram
ggplot(state_rev, aes(revenue))+
geom_histogram(position="identity", color = 4, fill = "white", bins = 50)+
labs (x="Revenue", y = "Density", title="Average Museum Revenues of U.S. States")+
theme(legend.position="none")
The histogram shows us there are two major outliers. Let’s find out what locations they are and then remove them to visualize the remaining states better.
#Recall the revenues of the top 3 states
state_rev[1:3,]
# A tibble: 3 × 2
admin_state revenue
<chr> <dbl>
1 DC 90133011.
2 AZ 25296830.
3 MD 18554123.
This shows us DC has the highest average revenue for museums, at $90 milion, which means it is represented in the farthest right histogram column in our last visual. Though we should keep this observation in mind, it will be helpful to remove it for a zoomed-in look at the remaining data. We can use the revenue from our 2nd highest, AZ at 2.6e+07, to help filter our data for an improved visual.
#Filter only the states with an average revenue of under 75,000,000.
state_rev_fil <- filter(state_rev, revenue < 3e+07)
#Remake a histogram with the filtered data
ggplot(state_rev_fil, aes(revenue))+
geom_histogram(position="identity", color = 2, fill = "orange", bins = 50)+
labs (x="Revenue", y = "Density", title="Average Museum Revenues of U.S. States")+
theme(legend.position="none")
That is already much better. But since the data includes locations, it may be best visualized another way. To visualize revenue by state, which is a continuous variable by a discrete, we can create a map.
#Pull and rename state map data
usmap <- map_data("state")
#Create a column to match map data
mus$region <- mus$admin_state
#Join map data with museum data
museum_map <- full_join(mus, usmap, by = "region")
#Create new data frame
state_rev <- as.data.frame(state_rev)
#Create new column names "state"
state_rev$state <- state_rev$admin_state
#Calculate percentiles for defining legend
x <- state_rev_fil$revenue
quantile(x, probs = seq(0, 1, 0.25))
0% 25% 50% 75% 100%
110453.5 1891862.5 4063450.9 9040423.3 25296830.4
#Create a vector to define legend
rev_breaks <- c(1.8e+06, 3.9e+06, 9e+06, 2.6e+07)
#Create a map
plot_usmap(
regions = "states", data = state_rev, values = "revenue")+
theme(panel.background = element_rect(colour = "black"))+
scale_fill_continuous(low = "blue", high ="red",
name = "Mean Revenue",label = scales::comma,
limits = c(0,3e+07)) +
theme(legend.position = "right")+
labs (title="Average Museum Revenues of U.S. States")
One limitation of this visual lies in the outlier being extracted from the color gradient. DC, the location with the highest average revenue is not highlighted in the color scale because the number skews the data in a way that obscures the visualization of the revenue in all other states. This map visual does, however, allow us to quickly assess which the distribution of average museum revenue by state.
#Create a column bar graph
ggplot(state_rev, mapping = aes(x = state, y = revenue, color = state))+
geom_col(width = 0.75, position = position_dodge(0.7), fill= "white")+
labs (x="State", y = "Revenue", title = "Average Revenue by State")+
theme (legend.position = "none", axis.text.x = element_text(margin = unit(c(0, 0, 0, 0), "mm"), angle = 90))
The column bar graph can illustrate state-by-state averages more clearly. We can see that DC and has the highest average revenue by far at a quick glance, with AZ next behind it.
An attempt to create a boxplot or violin plot was made with this data grouping. However, the state_rev data group contains only one numeric value - the mean revenue of each state. While box and violin plots show a range of values, we can investigate total revenue by state rather than the average.
#Investigate total revenue by state
state_total <- as.data.frame(mus %>%
group_by (admin_state) %>%
mutate (total_rev = sum(revenue)) %>%
distinct (admin_state, total_rev) %>%
select (admin_state, total_rev) %>%
arrange(desc(total_rev)))
head(state_total)
admin_state total_rev
1 CA 21575961438
2 NY 21324000468
3 IL 11266964329
4 PA 10187718826
5 DC 8202103987
6 MD 6827917194
This shows the total revenue of CA, NY, and IL are the highest states for revenue production, respectively, from museums in 2014. Let’s visualie this.
#Create new column to match map column
state_total$state <- state_total$admin_state
#Create a map of TOTAL revenue by state
plot_usmap(
regions = "states", data = state_total, values = "total_rev")+
theme(panel.background = element_rect(colour = "black"))+
scale_fill_continuous(low = "blue", high ="red",
name = "Total Revenue",label = scales::comma,) +
theme(legend.position = "right")+
labs (title="Total Revenues of Museums by U.S. States")
This visual may be more practical in terms of learning about states across the country and their museum-to-community interaction. A higher total revenue may indicate higher turnout on big exhibit events, more interationc with the community on a daily basis, or perhaps more donations from from the public and coorporations.
#Create a boxplot of all revenue observations by state
ggplot(mus, mapping = aes(x = admin_state, y = revenue, color = admin_state))+
geom_boxplot()+
labs (x="State", y = "Revenue", title = "Revenue by State")+
theme (legend.position = "none", axis.text.x = element_text(margin = unit(c(0, 0, 0, 0), "mm"), angle = 90))
This visual shows that AZ has few points, but the highest outlier, which could explain why it is one of the highest mean revenue. It also shows that CA has several high value outliers that explain its position as the highest total revenue state. But why are the boxes flat?
# A tibble: 2 × 2
`revenue == "0"` n
<lgl> <int>
1 FALSE 10115
2 TRUE 10148
This explains why the boxes in the boxplot are flat. There are more rows with 0 revenue than with a positive number.
This may raise a new question. How is museum type distributed among the top 3 states with the highest total revenue?
#Count types of museums in the top revenue states
mus %>%
select (admin_state, mus_type) %>%
filter(admin_state == "CA" | admin_state == "NY" | admin_state == "IL") %>%
count(mus_type)
# A tibble: 9 × 2
mus_type n
<chr> <int>
1 Art 359
2 Children 72
3 Garden 145
4 General 706
5 History 339
6 Natrual History 43
7 Preservation 2047
8 Science 64
9 <NA> 52
#Assign total revenue column to data frame
mus_rev <- mus %>%
group_by (admin_state) %>%
mutate (total_rev = sum(revenue))
#Filter top 3 states, graph by revenue and museum type
mus_rev %>%
select (admin_state, revenue, total_rev, mus_type) %>%
filter(admin_state == "CA" | admin_state == "NY" | admin_state == "IL") %>%
group_by (admin_state) %>%
arrange(desc(total_rev)) %>%
ggplot( mapping = aes(x = admin_state, y = revenue, fill = mus_type))+
geom_col()+
labs (x="State", y = "Revenue", title = "Highest total Revenue States and Musteum Type")+
theme (axis.text.x = element_text(margin = unit(c(0, 0, 0, 0), "mm"), angle = 90))
This visual shows that Art museums account for the highest amount of revenue in CA, IL, and NY. Interestingly we can now see that history museums contribute a large amount of revenue for NY compared with the other two states. The number of history museums in CA, IL, and NY are 151, 66, and 122, respectively. Which means NY has fewer history museums, but generates a higher revenue from them than CA. Perhaps NY charges more for admission or is given more donations that are counted as revenue.
#Find mean revenue by museum type
top_rev <- mus %>%
group_by(mus_type) %>%
select(revenue, mus_type) %>%
summarize_all(mean, na.rm = TRUE) %>%
ungroup() %>%
arrange(desc(revenue)) %>%
slice(1:3)
top_rev
# A tibble: 3 × 2
mus_type revenue
<chr> <dbl>
1 Art 58330122.
2 Science 49744546.
3 Garden 21137443.
The museum types with the highest average revenue are Art, Science, and Garden museums, respectively.
low_rev <- mus %>%
group_by(mus_type) %>%
select(revenue, mus_type) %>%
summarize_all(mean) %>%
ungroup() %>%
arrange(revenue) %>%
slice(1:3)
low_rev
# A tibble: 3 × 2
mus_type revenue
<chr> <dbl>
1 Preservation 842426.
2 Children 1249144.
3 General 4505061.
The museum types with the lowest average revenue are Preservation, Children, and General museums, respectively.
#Create a boxplot visual
mus%>%
drop_na (mus_type) %>%
ggplot (., mapping = aes(x = revenue, y = mus_type, fill = mus_type))+
geom_boxplot(outlier.colour= NULL)+
labs (title = "Revenue by Museum type", x = "Revenue", y = "Museum Type")+
theme (legend.position="none")
#Investigate how many museums have a revenue of 0
(zero <- sum(mus$revenue == "0"))
[1] 10148
#COunt number of total observations
nrow(mus)
[1] 20263
[1] "50.08%"
Art museums have the highest average revenue, while Preservation museums have the lowest. This boxplot accurately portrays that relationship, with more high outliers in Art than any other category and fewer in Preservation. The appearance of flattened boxes is explained by the high volume of museums, 10,148 to be exact, that reported a revenue of “0”.
#Create a data frame to work with
type_income <- data.frame(mus %>%
select(mus_type, income) %>%
group_by(mus_type) %>%
mutate(total_income = sum(income))%>% #New column with total income
arrange(desc(total_income))) #Arrange by highest total income
#Confirm the new object is a data frame and not a tibble
is.data.frame(type_income)
[1] TRUE
#Create visual of total income vs museum type
type_income %>%
mutate(mus_type = fct_reorder(mus_type, total_income)) %>%
ggplot(mapping = aes(x = mus_type, y = total_income, color = mus_type))+
geom_col(width = 0.75, position = position_dodge(0.7), fill= "white")+
labs (x="State", y = "Revenue", title = "Total income by Museum type")+
theme (legend.position = "none", axis.text.x = element_text(margin = unit(c(0, 0, 0, 0), "mm"), angle = 90))
Art museums had the highest total income in 2014 in the U.S..
#Find most common museum type by state
state_type <- mus %>%
group_by(admin_state) %>%
count(mus_type) %>%
slice(which.max(n)) %>%
select(-n)
#Create data frame
state_type_df <- as.data.frame(state_type)
head(state_type_df)
admin_state mus_type
1 AK Preservation
2 AL Preservation
3 AR Preservation
4 AZ Preservation
5 CA Preservation
6 CO Preservation
#Confirm the data includes DC as a state by a count
length(state_type_df$admin_state)
[1] 51
#Count number of states in which Preservation is the most common
length(which(state_type_df$mus_type == "Preservation"))
[1] 49
As expected, given our understanding of the most common type of museum, the Preservation museum type is the most common in all but 2 states.
#Identify the states where Preservation is not the most common
state_type_df %>%
select(admin_state, mus_type) %>%
filter(mus_type != "Preservation")
admin_state mus_type
1 UT General
2 WY General
Wyoming and DC are the “states” that do not have Preservations are their most common museum. Interestingly, they share the most common type, which is General museums.
#Create a scatterplot of museum income vs. revenue.
mus %>%
ggplot(.,aes(x = income, y = revenue))+
geom_point(size = 3, shape = 23)+
geom_smooth()+
labs (x="Income", y = "Revenue", title = "U.S. Museum 2014 Income vs. Revenue")
The above scatterplot shows a strong positive, linear association between the museums’ incomes and revenues with a number of potential outliers.
#We can remove the far most outlier to zoom in a bit
#Compute interquartile range to identify outliers
Q1 <- quantile(mus$income, .25)
Q3 <- quantile(mus$income, .75)
(IQR <- IQR(mus$income))
[1] 171649
(low_lim = Q1 - 1.5 * IQR)
25%
-257473.5
(up_lim = Q3 + 1.5 * IQR)
75%
429122.5
#Create tibble to view the outliers
outliers <- mus %>%
select(mus_type, admin_state, income) %>%
filter(income > up_lim | income < low_lim) %>%
arrange(income)
outliers
# A tibble: 3,459 × 3
mus_type admin_state income
<chr> <chr> <dbl>
1 History MA 429295
2 History DC 429387
3 Preservation MI 429604
4 General OH 429694
5 Children TN 429888
6 History OK 430316
7 General MN 430400
8 General MA 430419
9 History TX 430483
10 General NJ 430567
# … with 3,449 more rows
#Calculate percent of data we would drop if we extract outliers
percent((nrow(outliers))/(nrow(mus)), accuracy = 0.01)
[1] "17.07%"
#Remove the outliers
income_fil <- as.data.frame (mus %>%
filter(income < up_lim & income > low_lim))
income_fil %>%
ggplot(.,aes(x = income, y = revenue))+
geom_point(size = 2, shape = 23, color = 6)+
geom_smooth()+
labs (x="Income", y = "Revenue", title = "Outlier Removed - U.S. Museum Income vs. Revenue")
Extracing the statistical outliers has returned a subset of the data that has a very strong linear and slightly positive correlation between income and revenue.
#Create data frame with summarized revenue
total_rev <- as.data.frame(mus %>%
select(mus_type, admin_state, revenue, admin_state) %>%
group_by(mus_type) %>%
mutate(total_revenue = sum(revenue)) %>%
arrange(desc(total_revenue))) %>%
ungroup()%>%
arrange(desc(total_revenue))
#Column chart with total revenue by museum type
total_rev %>%
mutate(mus_type = fct_reorder(mus_type, total_revenue)) %>%
ggplot(aes(mus_type, total_revenue, fill = mus_type))+
geom_col()+
theme(axis.title.x = element_blank())+
labs (x="Museum Type", y = "Total Revenue", title = "Total Revenue by of U.S. Museum Types", fill = "Museum Type")+
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank())
#Verify the top three types by revenue
total_rev %>%
group_by(mus_type) %>%
summarize(tot_rev = sum(revenue)) %>%
arrange(desc(tot_rev)) %>%
slice (1:3)
# A tibble: 3 × 2
mus_type tot_rev
<chr> <dbl>
1 Art 91286640832
2 Science 17510080041
3 General 16497532030
The museum type with the highest total revenue from 2014 is Art Museums, followed by Science Museums and General museums.
A significant amount of time was spent analyzing question 6 much deeper. Namely, the statistics which were used to identify the outliers was a learning curve. However, the end result satisfies the goal well. Additionally, a substantial amount of time was spent find the best way to adjust the range on the map to properly show the distribution of values after figuring the best way to filter out the top most outlier, which was skewing the original scale. The 2 states whose most common museum is not preservation was identified. Additional notes were made regarding missing values, including theorizing the cause and effect of whether to leave them, as well as adding computations to support the ideas.
During a deeper analysis of question 3, it was discovered that there were duplicate entries in the orinigal data. Before removing duplicates, Science was the museum with the highest average revenue, but only because the observation with its highest revenue was repeated 3 times. After removing duplicates, Art was found to be the hightest average revenu generator. This problem fix also changed answers to other questions, including question 2 which asked which states have the highest average revenue. With duplicates, MA was the second highest by a far margin outside the rest of the data. After removing duplicates, it fell to 13th. So, removing duplicates has a significant effect on mean calculations. It also meant there was only 1 outlier to adjust for when visualizing questions 2, 3, and 6. Removing the duplicates was a game-changer. It also drastically altered the final scatter plot in question 6. While the trend is similar (linear and positve), the degree to which the relationship is positive is much less steep than before the data was fully cleaned.
Going forward, checking for and removing true duplicates will be a part of every data cleaning project.
Keep in mind DC is counted as a state for purposes of this data analysis.
Most common type of museum in the U.S.: Preservations, with a total count of 11,316, accounting for 56% of all U.S. museums.
2nd and 3rd most common museum types in the U.S.: General museums are 2nd most common with 3,662 in total and 18% of all museums. History museums are in 3rd with 1,880 total museums and accounting for 9% of all U.S. museums.
States with the highest AVERAGE revenue: DC is 1st with $90,133,011, Arionza is 2nd with $25,296,830, and Maryland is 3rd with $18,554,123.
States with ther highest TOTAL revenue: CA is 1st with $21,575,961,438, NY is 2nd with $21,324,000,468, and IL is 3rd with $11,266,964,329.
Museum type with the highest average revenue: Art Museums. Type of museum with the highest total income: Art Museums. Type of museum with the highest total revenue: Art Museums.
Museum type with the lowest average revenue: Historic Preservation Museums.
Number of museums with a revenue of “0”: 10,148 out of 20,263 museums, or 50.08% of the data.
Most common type of museum in each state: Historic Preservation is the most common type in 9 out of the 51 (including DC) states. Utah and Wyoming’s most common museum was General museums.
Percent of outliers calculated by IQR: 17.07% This amount of data was removed to visualize income vs. revenue, which is a weakly positive, but strongly linear correlation.
R: R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
museum information: https://www.aam-us.org/programs/about-museums/museum-facts-data/
kaggle: https://www.kaggle.com/code/priankravichandar/museums-data-analysis
Course textbook: Grolemund, G., & Wickham, H. (2017). R for Data Science. O’Reilly Media.