Hello all! Greetings. My name is Gissella Nadya and welcome to my very first Rmd! This project is my submission to Algorit.ma Data Science School “Learning By Building” project for P4DS week and I choose the subject on Museums, Aquariums and Zoos (MAZ) because I am very interested in finding out more about it! The data used on this project could be found on Kaggle. Now let’s get digging!

1 Background

The compilation of this dataset occurred in 2014. It was compiled from Institute of Museum and Library Services (IMLS) administrative records for discretionary grant recipients, IRS records for tax-exempt organizations, and private foundation grant recipients. For this project, we are going to look as the US’s government. Let’s pretend that we are the United States Federal Government and we are going to get to know the assets, in terms of Tourists Places such as Museums, Aquariums, and Zoos that the country got.

Let’s get to know our dataset further!

2 Data Wrangling

2.1 Reading the File

library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ✓ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
maz <-  read.csv("museums.csv")
maz

2.2 Looking Through the Columns

str(maz)
## 'data.frame':    33072 obs. of  25 variables:
##  $ Museum.ID                               : num  8.4e+09 8.4e+09 8.4e+09 8.4e+09 8.4e+09 ...
##  $ Museum.Name                             : chr  "ALASKA AVIATION HERITAGE MUSEUM" "ALASKA BOTANICAL GARDEN" "ALASKA CHALLENGER CENTER FOR SPACE SCIENCE TECHNOLOGY" "ALASKA EDUCATORS HISTORICAL SOCIETY" ...
##  $ Legal.Name                              : chr  "ALASKA AVIATION HERITAGE MUSEUM" "ALASKA BOTANICAL GARDEN INC" "ALASKA CHALLENGER CENTER FOR SPACE SCIENCE TECHNOLOGY INC" "ALASKA EDUCATORS HISTORICAL SOCIETY" ...
##  $ Alternate.Name                          : chr  "" "" "" "" ...
##  $ Museum.Type                             : chr  "HISTORY MUSEUM" "ARBORETUM, BOTANICAL GARDEN, OR NATURE CENTER" "SCIENCE & TECHNOLOGY MUSEUM OR PLANETARIUM" "HISTORIC PRESERVATION" ...
##  $ Institution.Name                        : chr  "" "" "" "" ...
##  $ Street.Address..Administrative.Location.: chr  "4721 AIRCRAFT DR" "4601 CAMPBELL AIRSTRIP RD" "9711 KENAI SPUR HWY" "214 BIRCH STREET" ...
##  $ City..Administrative.Location.          : chr  "ANCHORAGE" "ANCHORAGE" "KENAI" "KENAI" ...
##  $ State..Administrative.Location.         : chr  "AK" "AK" "AK" "AK" ...
##  $ Zip.Code..Administrative.Location.      : chr  "99502" "99507" "99611" "99611" ...
##  $ Street.Address..Physical.Location.      : chr  "" "" "" "" ...
##  $ City..Physical.Location.                : chr  "" "" "" "" ...
##  $ State..Physical.Location.               : chr  "" "" "" "" ...
##  $ Zip.Code..Physical.Location.            : int  NA NA NA NA NA NA 99508 NA 99519 NA ...
##  $ Phone.Number                            : chr  "9072485325" "9077703692" "9072832000" "2142472478" ...
##  $ Latitude                                : num  61.2 61.2 60.6 60.6 61.2 ...
##  $ Longitude                               : num  -150 -150 -151 -151 -150 ...
##  $ Locale.Code..NCES.                      : int  1 4 3 3 1 1 1 3 1 2 ...
##  $ County.Code..FIPS.                      : int  20 20 122 122 20 20 20 110 20 90 ...
##  $ State.Code..FIPS.                       : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ Region.Code..AAM.                       : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ Employer.ID.Number                      : chr  "920071852" "920115504" "921761906" "920165178" ...
##  $ Tax.Period                              : int  201312 201312 201312 201412 201312 NA 201312 201312 201406 201412 ...
##  $ Income                                  : num  602912 1379576 740030 0 602912 ...
##  $ Revenue                                 : num  550236 1323742 729080 0 550236 ...

Awesome! so now we know that there are 33.072 observations with 25 columns to be inspected. The columns name here is a bit messy, let’s change it first so it’s readable.

2.3 Changing Columns Name

colnames(maz)
##  [1] "Museum.ID"                               
##  [2] "Museum.Name"                             
##  [3] "Legal.Name"                              
##  [4] "Alternate.Name"                          
##  [5] "Museum.Type"                             
##  [6] "Institution.Name"                        
##  [7] "Street.Address..Administrative.Location."
##  [8] "City..Administrative.Location."          
##  [9] "State..Administrative.Location."         
## [10] "Zip.Code..Administrative.Location."      
## [11] "Street.Address..Physical.Location."      
## [12] "City..Physical.Location."                
## [13] "State..Physical.Location."               
## [14] "Zip.Code..Physical.Location."            
## [15] "Phone.Number"                            
## [16] "Latitude"                                
## [17] "Longitude"                               
## [18] "Locale.Code..NCES."                      
## [19] "County.Code..FIPS."                      
## [20] "State.Code..FIPS."                       
## [21] "Region.Code..AAM."                       
## [22] "Employer.ID.Number"                      
## [23] "Tax.Period"                              
## [24] "Income"                                  
## [25] "Revenue"
maz <- setNames(maz , c( "Museum_ID", "Museum_Name", "Legal_Name", "Alternate_Name", "Museum_Type", "Institution_Name", "Administrative_Street_Address", "Administrative_City", "Administrative_State", "Administrative_Zipcode", "Physical_Street_Name", "Physical_City", "Physical_State", "Physical_Zipcode", "Phone", "Latitude", "Longitude", "Locale_Code_NCES", "County_Code_FIPS", "State_Code_FIPS", "Region-Code_AAM", "Employer_ID", "Tax_Period", "Income", "Revenue" ) )
maz

So much better! We notice that there is multiple columns that are basically the same thing, what are we going to do with it ? On to the next steps!

2.4 Defining Columns Needed and Checking NA (Missing Values)

Let’s make a new data frame consists the relevant columns.

newmaz <- maz[ c("Museum_ID", "Museum_Name", "Museum_Type", "Administrative_City", "Administrative_State", "Locale_Code_NCES", "Region-Code_AAM", "Tax_Period",  "Income", "Revenue")]
newmaz
dim(newmaz)
## [1] 33072    10
33072 * 5 / 100
## [1] 1653.6

We then inspect if there are any missing values within the columns, and focus more on the columns that have at least 95% of the data available, therefore we are going to disregard some of the columns that have many NA value.

We notice that most of Institution_Name are empty therefore we are going to drop this columns. Other than that, we noticed that County_Code_FIPS have above 1653.6 missing values, therefore we are going to drop this too. State_Code_FIPS are also going to be dropped because it’s represented already by Administrative_State.

colSums(is.na(newmaz))
##            Museum_ID          Museum_Name          Museum_Type 
##                    0                    0                    0 
##  Administrative_City Administrative_State     Locale_Code_NCES 
##                    0                    0                   77 
##      Region-Code_AAM           Tax_Period               Income 
##                    0                 9792                10111 
##              Revenue 
##                10782

The Tax_Period, Income, and Revenue have many missing values or NA. We are going to assume that these are because the datas have not been inputted yet into the system. Because we are going to inspect the Tax_Period, Income, and Revenue columns we are going to separate it, having a clean dataframe.

2.5 Changing Data Types

str(newmaz)
## 'data.frame':    33072 obs. of  10 variables:
##  $ Museum_ID           : num  8.4e+09 8.4e+09 8.4e+09 8.4e+09 8.4e+09 ...
##  $ Museum_Name         : chr  "ALASKA AVIATION HERITAGE MUSEUM" "ALASKA BOTANICAL GARDEN" "ALASKA CHALLENGER CENTER FOR SPACE SCIENCE TECHNOLOGY" "ALASKA EDUCATORS HISTORICAL SOCIETY" ...
##  $ Museum_Type         : chr  "HISTORY MUSEUM" "ARBORETUM, BOTANICAL GARDEN, OR NATURE CENTER" "SCIENCE & TECHNOLOGY MUSEUM OR PLANETARIUM" "HISTORIC PRESERVATION" ...
##  $ Administrative_City : chr  "ANCHORAGE" "ANCHORAGE" "KENAI" "KENAI" ...
##  $ Administrative_State: chr  "AK" "AK" "AK" "AK" ...
##  $ Locale_Code_NCES    : int  1 4 3 3 1 1 1 3 1 2 ...
##  $ Region-Code_AAM     : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ Tax_Period          : int  201312 201312 201312 201412 201312 NA 201312 201312 201406 201412 ...
##  $ Income              : num  602912 1379576 740030 0 602912 ...
##  $ Revenue             : num  550236 1323742 729080 0 550236 ...
newmaz[ , c("Museum_Type", "Administrative_City", "Administrative_State", "Locale_Code_NCES", "Region-Code_AAM")] <-  lapply(newmaz[ , c("Museum_Type", "Administrative_City", "Administrative_State", "Locale_Code_NCES",  "Region-Code_AAM")], as.factor)
newmaz$Tax_Period <- as.numeric(newmaz$Tax_Period)  
options(scipen=999)
str(newmaz)
## 'data.frame':    33072 obs. of  10 variables:
##  $ Museum_ID           : num  8400200098 8400200117 8400200153 8400200143 8400200027 ...
##  $ Museum_Name         : chr  "ALASKA AVIATION HERITAGE MUSEUM" "ALASKA BOTANICAL GARDEN" "ALASKA CHALLENGER CENTER FOR SPACE SCIENCE TECHNOLOGY" "ALASKA EDUCATORS HISTORICAL SOCIETY" ...
##  $ Museum_Type         : Factor w/ 9 levels "ARBORETUM, BOTANICAL GARDEN, OR NATURE CENTER",..: 6 1 8 5 6 5 4 5 4 6 ...
##  $ Administrative_City : Factor w/ 8621 levels "400 NEVIN AVE",..: 154 154 3810 3810 154 154 154 1983 154 2425 ...
##  $ Administrative_State: Factor w/ 51 levels "AK","AL","AR",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Locale_Code_NCES    : Factor w/ 4 levels "1","2","3","4": 1 4 3 3 1 1 1 3 1 2 ...
##  $ Region-Code_AAM     : Factor w/ 6 levels "1","2","3","4",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Tax_Period          : num  201312 201312 201312 201412 201312 ...
##  $ Income              : num  602912 1379576 740030 0 602912 ...
##  $ Revenue             : num  550236 1323742 729080 0 550236 ...
colSums(is.na(newmaz))
##            Museum_ID          Museum_Name          Museum_Type 
##                    0                    0                    0 
##  Administrative_City Administrative_State     Locale_Code_NCES 
##                    0                    0                   77 
##      Region-Code_AAM           Tax_Period               Income 
##                    0                 9792                10111 
##              Revenue 
##                10782

Okay, so we know there is still missing value on Locale_Code_NCES, we are going to categorize this as ‘0’ instead of dropping it.

newmaz$Locale_Code_NCES <- as.character(newmaz$Locale_Code_NCES)
newmaz$Locale_Code_NCES <-  replace(newmaz$Locale_Code_NCES, is.na(newmaz$Locale_Code_NCES), 0)
newmaz$Locale_Code_NCES <- as.factor(newmaz$Locale_Code_NCES)
#clean data frame with drop.na
clean_newmaz <- newmaz %>% 
  drop_na(Tax_Period, Income, Revenue)
colSums(is.na(clean_newmaz))
##            Museum_ID          Museum_Name          Museum_Type 
##                    0                    0                    0 
##  Administrative_City Administrative_State     Locale_Code_NCES 
##                    0                    0                    0 
##      Region-Code_AAM           Tax_Period               Income 
##                    0                    0                    0 
##              Revenue 
##                    0
str(clean_newmaz)
## 'data.frame':    22290 obs. of  10 variables:
##  $ Museum_ID           : num  8400200098 8400200117 8400200153 8400200143 8400200027 ...
##  $ Museum_Name         : chr  "ALASKA AVIATION HERITAGE MUSEUM" "ALASKA BOTANICAL GARDEN" "ALASKA CHALLENGER CENTER FOR SPACE SCIENCE TECHNOLOGY" "ALASKA EDUCATORS HISTORICAL SOCIETY" ...
##  $ Museum_Type         : Factor w/ 9 levels "ARBORETUM, BOTANICAL GARDEN, OR NATURE CENTER",..: 6 1 8 5 6 4 5 4 6 4 ...
##  $ Administrative_City : Factor w/ 8621 levels "400 NEVIN AVE",..: 154 154 3810 3810 154 154 1983 154 2425 154 ...
##  $ Administrative_State: Factor w/ 51 levels "AK","AL","AR",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Locale_Code_NCES    : Factor w/ 5 levels "0","1","2","3",..: 2 5 4 4 2 2 4 2 3 2 ...
##  $ Region-Code_AAM     : Factor w/ 6 levels "1","2","3","4",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Tax_Period          : num  201312 201312 201312 201412 201312 ...
##  $ Income              : num  602912 1379576 740030 0 602912 ...
##  $ Revenue             : num  550236 1323742 729080 0 550236 ...

Now that the data is all clean we can determined our Business Problems.

3 Business Problems

  1. How many Museum types are available in the US?
  2. How is the distribution of Museums, Aquariums, and Zoos in the US?
  3. Where are these Museums located based on the Locale Code?
  4. What are the most profitable Museums, Aquariums, and Zoos out of all?
  5. What are the fan’s favorite Museums, Aquariums, and Zoos according to their revenue?
  6. Which Museums, Aquariums, and Zoos have to pay the highest taxes?
  7. Which states received the highest revenues?
  8. Which cities received the highest income through Museums, Aquariums, and Zoos ?

4 Exploratory Data Analysis

4.1 Brief Summary

dim(clean_newmaz)
## [1] 22290    10
summary(clean_newmaz)
##    Museum_ID          Museum_Name       
##  Min.   :8400100002   Length:22290      
##  1st Qu.:8401800036   Class :character  
##  Median :8403300242   Mode  :character  
##  Mean   :8403290558                     
##  3rd Qu.:8404201578                     
##  Max.   :8409504380                     
##                                         
##                                         Museum_Type      Administrative_City
##  HISTORIC PRESERVATION                        :12382   NEW YORK    :  182   
##  GENERAL MUSEUM                               : 4029   CHICAGO     :  141   
##  HISTORY MUSEUM                               : 2029   WASHINGTON  :  137   
##  ART MUSEUM                                   : 1844   PHILADELPHIA:  124   
##  ARBORETUM, BOTANICAL GARDEN, OR NATURE CENTER:  742   PORTLAND    :   99   
##  SCIENCE & TECHNOLOGY MUSEUM OR PLANETARIUM   :  412   HOUSTON     :   98   
##  (Other)                                      :  852   (Other)     :21509   
##  Administrative_State Locale_Code_NCES Region-Code_AAM   Tax_Period    
##  CA     : 1713        0:  33           1:2148          Min.   :200212  
##  NY     : 1599        1:6179           2:3958          1st Qu.:201312  
##  TX     : 1216        2:5271           3:4258          Median :201312  
##  PA     : 1214        3:3770           4:5249          Mean   :201350  
##  OH     :  997        4:7037           5:3269          3rd Qu.:201409  
##  IL     :  873                         6:3408          Max.   :201504  
##  (Other):14678                                                         
##      Income               Revenue          
##  Min.   :          0   Min.   :  -2127393  
##  1st Qu.:          0   1st Qu.:         0  
##  Median :       5272   Median :      3307  
##  Mean   :  109962253   Mean   :  20976047  
##  3rd Qu.:     203810   3rd Qu.:    167696  
##  Max.   :83181439574   Max.   :5840349457  
## 

Through the brief Summary, we are able to know that:

  1. For this project, there are 22.290 observations, with 10 columns to take a look at.
  2. According to the Administrative City, New York held the most Museums, Aquariums and Zoos out of the other cities.
  3. While state wise, California has the most Museums Aquariums, and Zoos compared to the other states.
  4. There are 5 categories of Local Code. Previously we have determined that 0 is unknown. According to US locale code NCES, the digit 1 means Large City, 2 means Midsize City, 3 mean Urban Fringe of a Large City, and 4 Urban Fringe of a Midsize City. Therefore the Location of the Museums are mostly in the Urban Fringe of a Midsize Cities.
  5. The American Alliance of Museums (AAM) divided the US into 6 Regions, those are Association of Midwest Museum, Mid-Atlantic Association of Museums, Mountain Plains Museums Association, New England Museum Association, Southeastern Museums Conference, and Western Museums Association. However the definition of the number were not provided in the dataset. If we assume the numbers are according on its appearance on the website, Region 4 aka New England Museum Association has the largest amount of museums that they are organizing.
  6. Looking through the min, max and the median of the Tax Period, it seems like the range of the money that these Museums have to pay to the government are almost alike.
  7. Interestingly for the Income and Revenue columns, both of them have a Q1 of 0 and even the minimum value of Revenue is -2,127,393!

Now that we got a bigger picture of our dataset, let’s answer our Business problems!

5 Data Transformation

5.1 Museum Types in the US?

types <- as.data.frame(table(clean_newmaz$Museum_Type))
types <- setNames(types, c("Museum_Types", "Freq"))
types <- types[order(x = types$Freq, decreasing = T), ]
types

There are 9 types of Museums with the highest quantities of them all is “HISTORIC PRESERVATION” and the lowest quantities is “NATURAL HISTORY MUSEUM”.

barplot(types$Freq ~ types$Museum_Types,
        col= rainbow(45),
        ylim=c(0,14000),
        names.arg=c("ABG", "AM", "CM", "GM", "HP", "HM", "NHM", "STM", "ZA" ))

How is the percentage ?

perc <- as.data.frame(prop.table(xtabs(formula = Freq ~ Museum_Types, data = types))*100)
perc <- perc[order(x = perc$Freq, decreasing = T), ]
perc$Freq <- round(x = perc$Freq, digits = 2)
perc

5.2 Is Museums distribution are bigger than Aquariums and Zoos ?

But what if we categorize them into Museums, Aquariums, and Zoos? We proposed to categorize that the museums with living breathing things (animals, plants) are included to the Aquariums and Zoos categories.

types$sub_types[types$Museum_Types=="HISTORIC PRESERVATION"]<-"Museums"
types$sub_types[types$Museum_Types=="GENERAL MUSEUM"] <- "Museums"
types$sub_types[types$Museum_Types=="HISTORY MUSEUM"] <- "Museums"
types$sub_types[types$Museum_Types=="ART MUSEUM"] <- "Museums"
types$sub_types[types$Museum_Types=="ARBORETUM, BOTANICAL GARDEN, OR NATURE CENTER"]<-"Aquariums, and Zoos"
types$sub_types[types$Museum_Types=="SCIENCE & TECHNOLOGY MUSEUM OR PLANETARIUM"] <- "Museums"
types$sub_types[types$Museum_Types=="CHILDREN'S MUSEUM"] <- "Museums"
types$sub_types[types$Museum_Types=="ZOO, AQUARIUM, OR WILDLIFE CONSERVATION"] <- "Aquariums, and Zoos"
types$sub_types[types$Museum_Types=="NATURAL HISTORY MUSEUM"] <- "Museums"

types$sub_types <-  as.factor(types$sub_types)
bar_1 <- barplot(xtabs(Freq ~ sub_types, types), 
                 col = c("skyblue3", "slateblue4"),
                 horiz = F,
                 las = 1, 
                 main = "Aquariums, and Zoos vs Museums in The US",
                 sub = "dataset: IMLS 2014",
                 ylim=c(0,25000),
                 #xlab = "Museums, Aquariums, and Zoo in the US",
                 ylab = "Quantities")

text(bar_1, xtabs(Freq ~ sub_types, types) + 0.4, 
     paste("n: ", xtabs(Freq ~ sub_types, types), sep = ""), 
     cex = 2, 
     col = "Black")

As we can see, the Museums in the US still dominating the country with 21.256 museums, meanwhile there are 1.034 that is considered as Aquariums, and Zoos.

Which States has the most Museums ?

sm_count <- as_data_frame(table(clean_newmaz[ c("Museum_Type", "Administrative_State")]))
## Warning: `as_data_frame()` is deprecated as of tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
sm_count <- sm_count[order(x = sm_count$n, decreasing = T), ]
head(sm_count, n= 5)

We can see that New York has the most Museums with a total of 898 and it is just the Historic Preservation types.

5.3 The type of Location These Museums Are Located

locode <- as.data.frame(table(clean_newmaz$Locale_Code_NCES))
locode <- set_names(locode, c("Locale_Code", "Freq"))
locode <- locode[order(x = locode$Freq, decreasing = T), ]
locode

As we can see here, there are 33 Museums that has a missing Locale Code value, therefore we named it 0. According to US locale code NCES, the digit means as follow:

  • 1 means Large City,

  • 2 means Midsize City,

  • 3 mean Urban Fringe of a Large City, and

  • 4 Urban Fringe of a Midsize City

We can then conclude that most of these Museums are located in a place that is considered as Urban Fringe of a Midsize City.

5.4 The Average, Highest, and Lowest Income They Received

summary(clean_newmaz$Income)
##        Min.     1st Qu.      Median        Mean     3rd Qu.        Max. 
##           0           0        5272   109962253      203810 83181439574

The average income received by the MAZ in the United States are around $109,962,253. But if we take a look on the Median value, the difference are way too drastic with only $5,272. This is may be caused by the highest income is $83,181,439,574 and the lowest amount is $0 !

boxplot(clean_newmaz$Income)

According to the Museum Types

mtype_income <- as.data.frame(xtabs(clean_newmaz$Revenue ~ clean_newmaz$Museum_Type))
mtype_income <- mtype_income[order(x = mtype_income$Freq, decreasing = T), ]
mtype_income

Based on the cross tabulations above, we can see that Art Museum is the type of museums that received the highest amount of Income during 2014.

Museums’ with the highest Income

top_income <-  clean_newmaz[clean_newmaz$Income == max(clean_newmaz$Income),]
count(top_income)

There are 20 museums that received the highest Income through out the year. How about which states received these highest Income? Is it going to be New York…?

States with the highest income

inct20 <-  as.data.frame(xtabs(top_income$Income ~ top_income$Administrative_State))
inct20[inct20$Freq >0, ]

Apparently, amongst the highest income received by these museums, only 2 states are included, those are AZ and MA.

These Museums received 0 income…

low_income <-  clean_newmaz[clean_newmaz$Income == min(clean_newmaz$Income),]
count(low_income)

There are 10733 Museums that received 0 income during 2014. This could means that there are 10,733 non-profits museum accross The United States

5.5 Fan’s Favorite Museums and its Types (Based on Revenue)

summary(clean_newmaz$Revenue)
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
##   -2127393          0       3307   20976047     167696 5840349457
minimum_rev <- clean_newmaz[clean_newmaz$Revenue == min(clean_newmaz$Revenue),]
minimum_rev$Museum_Name
## [1] "SHERIDAN HERTIAGE CENTER"

Wow there are some museums that received MINUS for their revenue. Yes, apparently SHERIDAN HERTIAGE CENTER received the lowest revenue over all other Museums.

Let’s find the top 5 revenue.

revt20 <- clean_newmaz[clean_newmaz$Revenue == max(clean_newmaz$Revenue),]
revt20

There are also 20 museums with the highest Revenue. Let’s check if they are the same thing as the top 20 Incomes.

unique(top_income$Museum_Name == revt20$Museum_Name)
## [1] TRUE

And the answer is, yes. They are the same thing. Here are the list of the Museums Name:

unique(revt20$Museum_Name)
##  [1] "FRED LAWRENCE WHIPPLE OBSERVATORY"                             
##  [2] "ARNOLD ARBORETUM OF HARVARD UNIVERSITY JAMAICA PLAIN"          
##  [3] "AUTHUR M. SACKLER MUSEUM"                                      
##  [4] "BUSCH-REISINGER MUSEUM"                                        
##  [5] "CENTER FOR CONSERVATION AND TECHNICAL STUDIES"                 
##  [6] "COLLECTION OF SCIENTIFIC INSTRUMENTS"                          
##  [7] "FISHER MUSEUM"                                                 
##  [8] "FOGG ART MUSEUM"                                               
##  [9] "GENERAL ARTEMAS WARD HOUSE"                                    
## [10] "HARVARD FOREST"                                                
## [11] "HARVARD UNIVERSITY ART MUSEUMS"                                
## [12] "HARVARD UNIVERSITY BOTANICAL MUSEUM"                           
## [13] "HARVARD UNIVERSITY HERBARIA"                                   
## [14] "HARVARD UNIVERSITY MINERALOGICAL AND GEOLOGICAL MUSEUM"        
## [15] "HARVARD UNIVERSITY MUSEUM OF COMPARATIVE ZOOLOGY"              
## [16] "HARVARD UNIVERSITY MUSEUMS OF NATURAL HISTORY"                 
## [17] "HARVARD UNIVERSITY PEABODY MUSEUM OF ARCHAEOLOGY AND ETHNOLOGY"
## [18] "HARVARD-SMITHSONIAN CENTER FOR ASTROPHYSICS"                   
## [19] "SEMITIC MUSEUM"                                                
## [20] "WARREN ANATOMICAL MUSEUM"

5.6 Now let’s talk about Taxes…

summary(clean_newmaz$Tax_Period)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  200212  201312  201312  201350  201409  201504
summary(newmaz$Tax_Period)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  199906  201312  201312  201347  201408  201504    9792

As we can see here we compare the data that has NA value and the clean version of it. The 5 numbers here look almost a like which a slight difference on the minimum, mean and Q3 values. Here we can conclude that There are 9,792 museums that have not inputted their data yet.

hightaxes <- clean_newmaz[clean_newmaz$Tax_Period == 201504 , ]
hightaxes
lowtaxes <- clean_newmaz[clean_newmaz$Tax_Period == 200212 , ]
lowtaxes

Wow it turns out that BUFFALO SOLDIERS OF THE ARIZONA TERRITORY, which located in AZ has to pay the highest taxes out of the others. Meanwhile, NEW BEDFORD MUSEUM OF GLASS located in MA, pay the lowest tax.

5.7 Revenue Wise These States Received …

state_revenue <- aggregate(formula = Revenue ~ Administrative_State, data = clean_newmaz, FUN = sum)
state_revenue <- state_revenue[order(state_revenue$Revenue, decreasing = T), ]
head(state_revenue, n = 5)

Out of all 50 States exist in The US, Massachusetts received the largest revenues from their Museums with $122,543,085,157, followed by California, New York, Connecticut, and Illinois.

tail(state_revenue, n = 5)

On the other hand, North Dakota received the least amount of revenue with the total of $11,360,290, followed by Wyoming, Nevada, Alaska, and New Mexico.

5.8 Cities with the highest and lowest Income through Museums, Aquariums, and Zoos

city_income <- aggregate(formula = Income ~ Administrative_City, data = clean_newmaz, FUN = sum)
city_income <- city_income[order(city_income$Income, decreasing = T),]
head(city_income, n = 5)
length(unique(clean_newmaz$Administrative_City))
## [1] 7381
count(city_income[city_income$Income == 0, ])

There are 3,336 cities that receive no income from Museums, Aquariums, and Zoos. As previously mentioned, this could means that these out of 7,381 cities, these 3,336 held the non-profit Museums.

We’ve talked about Income, Revenues and Tax Period. But what are the correlation between these threes?

cor(clean_newmaz$Income, clean_newmaz$Revenue)
## [1] 0.7999154
ggcorr(clean_newmaz, label = T,)
## Warning in ggcorr(clean_newmaz, label = T, ): data in column(s)
## 'Museum_Name', 'Museum_Type', 'Administrative_City', 'Administrative_State',
## 'Locale_Code_NCES', 'Region-Code_AAM' are not numeric and were ignored

First of all we could disregard the Museum_ID because it means nothing. Now, it looks like the correlation between Revenues and Incomes are strong positive with 0.8 as its value. This means that the higher the income, the higher revenue as well. We can also conclude that there are no correlation between the tax period and both income and revenue.

6 Conclusions

As we can see, out of 33,072 observations we have approximately 10,782 data that has NA values, which then lead us to disregard some of them. Therefore, if the data provided are more complete, it could open more possibilities for them to be explored.

We know that 55.55% of Museums in The United States are categorized as HISTORIC PRESERVATION, and Museums are still dominating the country rather than Aquariums and Zoos. We have explored many other things including the income and revenues that these Museums, Aquariums and Zoos received, the taxes they paid and how much taxes they have to pay. We also now know that there are some MAZ that not only receive less revenues than the others but losing money!

Thank you for taking the time to read my very first report Rpubs, hope you like it and I am open to many feedbacks! -gn

7 References

  1. Dataset
  2. Picture