Hello all! Greetings. My name is Gissella Nadya and welcome to my very first Rmd! This project is my submission to Algorit.ma Data Science School “Learning By Building” project for P4DS week and I choose the subject on Museums, Aquariums and Zoos (MAZ) because I am very interested in finding out more about it! The data used on this project could be found on Kaggle. Now let’s get digging!
The compilation of this dataset occurred in 2014. It was compiled from Institute of Museum and Library Services (IMLS) administrative records for discretionary grant recipients, IRS records for tax-exempt organizations, and private foundation grant recipients. For this project, we are going to look as the US’s government. Let’s pretend that we are the United States Federal Government and we are going to get to know the assets, in terms of Tourists Places such as Museums, Aquariums, and Zoos that the country got.
Let’s get to know our dataset further!
library(GGally)## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(tidyverse)## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble 3.0.4 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
maz <- read.csv("museums.csv")
mazstr(maz)## 'data.frame': 33072 obs. of 25 variables:
## $ Museum.ID : num 8.4e+09 8.4e+09 8.4e+09 8.4e+09 8.4e+09 ...
## $ Museum.Name : chr "ALASKA AVIATION HERITAGE MUSEUM" "ALASKA BOTANICAL GARDEN" "ALASKA CHALLENGER CENTER FOR SPACE SCIENCE TECHNOLOGY" "ALASKA EDUCATORS HISTORICAL SOCIETY" ...
## $ Legal.Name : chr "ALASKA AVIATION HERITAGE MUSEUM" "ALASKA BOTANICAL GARDEN INC" "ALASKA CHALLENGER CENTER FOR SPACE SCIENCE TECHNOLOGY INC" "ALASKA EDUCATORS HISTORICAL SOCIETY" ...
## $ Alternate.Name : chr "" "" "" "" ...
## $ Museum.Type : chr "HISTORY MUSEUM" "ARBORETUM, BOTANICAL GARDEN, OR NATURE CENTER" "SCIENCE & TECHNOLOGY MUSEUM OR PLANETARIUM" "HISTORIC PRESERVATION" ...
## $ Institution.Name : chr "" "" "" "" ...
## $ Street.Address..Administrative.Location.: chr "4721 AIRCRAFT DR" "4601 CAMPBELL AIRSTRIP RD" "9711 KENAI SPUR HWY" "214 BIRCH STREET" ...
## $ City..Administrative.Location. : chr "ANCHORAGE" "ANCHORAGE" "KENAI" "KENAI" ...
## $ State..Administrative.Location. : chr "AK" "AK" "AK" "AK" ...
## $ Zip.Code..Administrative.Location. : chr "99502" "99507" "99611" "99611" ...
## $ Street.Address..Physical.Location. : chr "" "" "" "" ...
## $ City..Physical.Location. : chr "" "" "" "" ...
## $ State..Physical.Location. : chr "" "" "" "" ...
## $ Zip.Code..Physical.Location. : int NA NA NA NA NA NA 99508 NA 99519 NA ...
## $ Phone.Number : chr "9072485325" "9077703692" "9072832000" "2142472478" ...
## $ Latitude : num 61.2 61.2 60.6 60.6 61.2 ...
## $ Longitude : num -150 -150 -151 -151 -150 ...
## $ Locale.Code..NCES. : int 1 4 3 3 1 1 1 3 1 2 ...
## $ County.Code..FIPS. : int 20 20 122 122 20 20 20 110 20 90 ...
## $ State.Code..FIPS. : int 2 2 2 2 2 2 2 2 2 2 ...
## $ Region.Code..AAM. : int 6 6 6 6 6 6 6 6 6 6 ...
## $ Employer.ID.Number : chr "920071852" "920115504" "921761906" "920165178" ...
## $ Tax.Period : int 201312 201312 201312 201412 201312 NA 201312 201312 201406 201412 ...
## $ Income : num 602912 1379576 740030 0 602912 ...
## $ Revenue : num 550236 1323742 729080 0 550236 ...
Awesome! so now we know that there are 33.072 observations with 25 columns to be inspected. The columns name here is a bit messy, let’s change it first so it’s readable.
colnames(maz)## [1] "Museum.ID"
## [2] "Museum.Name"
## [3] "Legal.Name"
## [4] "Alternate.Name"
## [5] "Museum.Type"
## [6] "Institution.Name"
## [7] "Street.Address..Administrative.Location."
## [8] "City..Administrative.Location."
## [9] "State..Administrative.Location."
## [10] "Zip.Code..Administrative.Location."
## [11] "Street.Address..Physical.Location."
## [12] "City..Physical.Location."
## [13] "State..Physical.Location."
## [14] "Zip.Code..Physical.Location."
## [15] "Phone.Number"
## [16] "Latitude"
## [17] "Longitude"
## [18] "Locale.Code..NCES."
## [19] "County.Code..FIPS."
## [20] "State.Code..FIPS."
## [21] "Region.Code..AAM."
## [22] "Employer.ID.Number"
## [23] "Tax.Period"
## [24] "Income"
## [25] "Revenue"
maz <- setNames(maz , c( "Museum_ID", "Museum_Name", "Legal_Name", "Alternate_Name", "Museum_Type", "Institution_Name", "Administrative_Street_Address", "Administrative_City", "Administrative_State", "Administrative_Zipcode", "Physical_Street_Name", "Physical_City", "Physical_State", "Physical_Zipcode", "Phone", "Latitude", "Longitude", "Locale_Code_NCES", "County_Code_FIPS", "State_Code_FIPS", "Region-Code_AAM", "Employer_ID", "Tax_Period", "Income", "Revenue" ) )
mazSo much better! We notice that there is multiple columns that are basically the same thing, what are we going to do with it ? On to the next steps!
Let’s make a new data frame consists the relevant columns.
newmaz <- maz[ c("Museum_ID", "Museum_Name", "Museum_Type", "Administrative_City", "Administrative_State", "Locale_Code_NCES", "Region-Code_AAM", "Tax_Period", "Income", "Revenue")]
newmazdim(newmaz)## [1] 33072 10
33072 * 5 / 100## [1] 1653.6
We then inspect if there are any missing values within the columns, and focus more on the columns that have at least 95% of the data available, therefore we are going to disregard some of the columns that have many NA value.
We notice that most of Institution_Name are empty therefore we are going to drop this columns. Other than that, we noticed that County_Code_FIPS have above 1653.6 missing values, therefore we are going to drop this too. State_Code_FIPS are also going to be dropped because it’s represented already by Administrative_State.
colSums(is.na(newmaz))## Museum_ID Museum_Name Museum_Type
## 0 0 0
## Administrative_City Administrative_State Locale_Code_NCES
## 0 0 77
## Region-Code_AAM Tax_Period Income
## 0 9792 10111
## Revenue
## 10782
The Tax_Period, Income, and Revenue have many missing values or NA. We are going to assume that these are because the datas have not been inputted yet into the system. Because we are going to inspect the Tax_Period, Income, and Revenue columns we are going to separate it, having a clean dataframe.
str(newmaz)## 'data.frame': 33072 obs. of 10 variables:
## $ Museum_ID : num 8.4e+09 8.4e+09 8.4e+09 8.4e+09 8.4e+09 ...
## $ Museum_Name : chr "ALASKA AVIATION HERITAGE MUSEUM" "ALASKA BOTANICAL GARDEN" "ALASKA CHALLENGER CENTER FOR SPACE SCIENCE TECHNOLOGY" "ALASKA EDUCATORS HISTORICAL SOCIETY" ...
## $ Museum_Type : chr "HISTORY MUSEUM" "ARBORETUM, BOTANICAL GARDEN, OR NATURE CENTER" "SCIENCE & TECHNOLOGY MUSEUM OR PLANETARIUM" "HISTORIC PRESERVATION" ...
## $ Administrative_City : chr "ANCHORAGE" "ANCHORAGE" "KENAI" "KENAI" ...
## $ Administrative_State: chr "AK" "AK" "AK" "AK" ...
## $ Locale_Code_NCES : int 1 4 3 3 1 1 1 3 1 2 ...
## $ Region-Code_AAM : int 6 6 6 6 6 6 6 6 6 6 ...
## $ Tax_Period : int 201312 201312 201312 201412 201312 NA 201312 201312 201406 201412 ...
## $ Income : num 602912 1379576 740030 0 602912 ...
## $ Revenue : num 550236 1323742 729080 0 550236 ...
newmaz[ , c("Museum_Type", "Administrative_City", "Administrative_State", "Locale_Code_NCES", "Region-Code_AAM")] <- lapply(newmaz[ , c("Museum_Type", "Administrative_City", "Administrative_State", "Locale_Code_NCES", "Region-Code_AAM")], as.factor)newmaz$Tax_Period <- as.numeric(newmaz$Tax_Period)
options(scipen=999)str(newmaz)## 'data.frame': 33072 obs. of 10 variables:
## $ Museum_ID : num 8400200098 8400200117 8400200153 8400200143 8400200027 ...
## $ Museum_Name : chr "ALASKA AVIATION HERITAGE MUSEUM" "ALASKA BOTANICAL GARDEN" "ALASKA CHALLENGER CENTER FOR SPACE SCIENCE TECHNOLOGY" "ALASKA EDUCATORS HISTORICAL SOCIETY" ...
## $ Museum_Type : Factor w/ 9 levels "ARBORETUM, BOTANICAL GARDEN, OR NATURE CENTER",..: 6 1 8 5 6 5 4 5 4 6 ...
## $ Administrative_City : Factor w/ 8621 levels "400 NEVIN AVE",..: 154 154 3810 3810 154 154 154 1983 154 2425 ...
## $ Administrative_State: Factor w/ 51 levels "AK","AL","AR",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Locale_Code_NCES : Factor w/ 4 levels "1","2","3","4": 1 4 3 3 1 1 1 3 1 2 ...
## $ Region-Code_AAM : Factor w/ 6 levels "1","2","3","4",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Tax_Period : num 201312 201312 201312 201412 201312 ...
## $ Income : num 602912 1379576 740030 0 602912 ...
## $ Revenue : num 550236 1323742 729080 0 550236 ...
colSums(is.na(newmaz))## Museum_ID Museum_Name Museum_Type
## 0 0 0
## Administrative_City Administrative_State Locale_Code_NCES
## 0 0 77
## Region-Code_AAM Tax_Period Income
## 0 9792 10111
## Revenue
## 10782
Okay, so we know there is still missing value on Locale_Code_NCES, we are going to categorize this as ‘0’ instead of dropping it.
newmaz$Locale_Code_NCES <- as.character(newmaz$Locale_Code_NCES)
newmaz$Locale_Code_NCES <- replace(newmaz$Locale_Code_NCES, is.na(newmaz$Locale_Code_NCES), 0)
newmaz$Locale_Code_NCES <- as.factor(newmaz$Locale_Code_NCES)#clean data frame with drop.na
clean_newmaz <- newmaz %>%
drop_na(Tax_Period, Income, Revenue)
colSums(is.na(clean_newmaz))## Museum_ID Museum_Name Museum_Type
## 0 0 0
## Administrative_City Administrative_State Locale_Code_NCES
## 0 0 0
## Region-Code_AAM Tax_Period Income
## 0 0 0
## Revenue
## 0
str(clean_newmaz)## 'data.frame': 22290 obs. of 10 variables:
## $ Museum_ID : num 8400200098 8400200117 8400200153 8400200143 8400200027 ...
## $ Museum_Name : chr "ALASKA AVIATION HERITAGE MUSEUM" "ALASKA BOTANICAL GARDEN" "ALASKA CHALLENGER CENTER FOR SPACE SCIENCE TECHNOLOGY" "ALASKA EDUCATORS HISTORICAL SOCIETY" ...
## $ Museum_Type : Factor w/ 9 levels "ARBORETUM, BOTANICAL GARDEN, OR NATURE CENTER",..: 6 1 8 5 6 4 5 4 6 4 ...
## $ Administrative_City : Factor w/ 8621 levels "400 NEVIN AVE",..: 154 154 3810 3810 154 154 1983 154 2425 154 ...
## $ Administrative_State: Factor w/ 51 levels "AK","AL","AR",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Locale_Code_NCES : Factor w/ 5 levels "0","1","2","3",..: 2 5 4 4 2 2 4 2 3 2 ...
## $ Region-Code_AAM : Factor w/ 6 levels "1","2","3","4",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Tax_Period : num 201312 201312 201312 201412 201312 ...
## $ Income : num 602912 1379576 740030 0 602912 ...
## $ Revenue : num 550236 1323742 729080 0 550236 ...
Now that the data is all clean we can determined our Business Problems.
dim(clean_newmaz)## [1] 22290 10
summary(clean_newmaz)## Museum_ID Museum_Name
## Min. :8400100002 Length:22290
## 1st Qu.:8401800036 Class :character
## Median :8403300242 Mode :character
## Mean :8403290558
## 3rd Qu.:8404201578
## Max. :8409504380
##
## Museum_Type Administrative_City
## HISTORIC PRESERVATION :12382 NEW YORK : 182
## GENERAL MUSEUM : 4029 CHICAGO : 141
## HISTORY MUSEUM : 2029 WASHINGTON : 137
## ART MUSEUM : 1844 PHILADELPHIA: 124
## ARBORETUM, BOTANICAL GARDEN, OR NATURE CENTER: 742 PORTLAND : 99
## SCIENCE & TECHNOLOGY MUSEUM OR PLANETARIUM : 412 HOUSTON : 98
## (Other) : 852 (Other) :21509
## Administrative_State Locale_Code_NCES Region-Code_AAM Tax_Period
## CA : 1713 0: 33 1:2148 Min. :200212
## NY : 1599 1:6179 2:3958 1st Qu.:201312
## TX : 1216 2:5271 3:4258 Median :201312
## PA : 1214 3:3770 4:5249 Mean :201350
## OH : 997 4:7037 5:3269 3rd Qu.:201409
## IL : 873 6:3408 Max. :201504
## (Other):14678
## Income Revenue
## Min. : 0 Min. : -2127393
## 1st Qu.: 0 1st Qu.: 0
## Median : 5272 Median : 3307
## Mean : 109962253 Mean : 20976047
## 3rd Qu.: 203810 3rd Qu.: 167696
## Max. :83181439574 Max. :5840349457
##
Through the brief Summary, we are able to know that:
Now that we got a bigger picture of our dataset, let’s answer our Business problems!
types <- as.data.frame(table(clean_newmaz$Museum_Type))
types <- setNames(types, c("Museum_Types", "Freq"))
types <- types[order(x = types$Freq, decreasing = T), ]
typesThere are 9 types of Museums with the highest quantities of them all is “HISTORIC PRESERVATION” and the lowest quantities is “NATURAL HISTORY MUSEUM”.
barplot(types$Freq ~ types$Museum_Types,
col= rainbow(45),
ylim=c(0,14000),
names.arg=c("ABG", "AM", "CM", "GM", "HP", "HM", "NHM", "STM", "ZA" ))How is the percentage ?
perc <- as.data.frame(prop.table(xtabs(formula = Freq ~ Museum_Types, data = types))*100)
perc <- perc[order(x = perc$Freq, decreasing = T), ]
perc$Freq <- round(x = perc$Freq, digits = 2)
percBut what if we categorize them into Museums, Aquariums, and Zoos? We proposed to categorize that the museums with living breathing things (animals, plants) are included to the Aquariums and Zoos categories.
types$sub_types[types$Museum_Types=="HISTORIC PRESERVATION"]<-"Museums"
types$sub_types[types$Museum_Types=="GENERAL MUSEUM"] <- "Museums"
types$sub_types[types$Museum_Types=="HISTORY MUSEUM"] <- "Museums"
types$sub_types[types$Museum_Types=="ART MUSEUM"] <- "Museums"
types$sub_types[types$Museum_Types=="ARBORETUM, BOTANICAL GARDEN, OR NATURE CENTER"]<-"Aquariums, and Zoos"
types$sub_types[types$Museum_Types=="SCIENCE & TECHNOLOGY MUSEUM OR PLANETARIUM"] <- "Museums"
types$sub_types[types$Museum_Types=="CHILDREN'S MUSEUM"] <- "Museums"
types$sub_types[types$Museum_Types=="ZOO, AQUARIUM, OR WILDLIFE CONSERVATION"] <- "Aquariums, and Zoos"
types$sub_types[types$Museum_Types=="NATURAL HISTORY MUSEUM"] <- "Museums"
types$sub_types <- as.factor(types$sub_types)bar_1 <- barplot(xtabs(Freq ~ sub_types, types),
col = c("skyblue3", "slateblue4"),
horiz = F,
las = 1,
main = "Aquariums, and Zoos vs Museums in The US",
sub = "dataset: IMLS 2014",
ylim=c(0,25000),
#xlab = "Museums, Aquariums, and Zoo in the US",
ylab = "Quantities")
text(bar_1, xtabs(Freq ~ sub_types, types) + 0.4,
paste("n: ", xtabs(Freq ~ sub_types, types), sep = ""),
cex = 2,
col = "Black")As we can see, the Museums in the US still dominating the country with 21.256 museums, meanwhile there are 1.034 that is considered as Aquariums, and Zoos.
Which States has the most Museums ?
sm_count <- as_data_frame(table(clean_newmaz[ c("Museum_Type", "Administrative_State")]))## Warning: `as_data_frame()` is deprecated as of tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
sm_count <- sm_count[order(x = sm_count$n, decreasing = T), ]
head(sm_count, n= 5)We can see that New York has the most Museums with a total of 898 and it is just the Historic Preservation types.
locode <- as.data.frame(table(clean_newmaz$Locale_Code_NCES))
locode <- set_names(locode, c("Locale_Code", "Freq"))
locode <- locode[order(x = locode$Freq, decreasing = T), ]
locodeAs we can see here, there are 33 Museums that has a missing Locale Code value, therefore we named it 0. According to US locale code NCES, the digit means as follow:
1 means Large City,
2 means Midsize City,
3 mean Urban Fringe of a Large City, and
4 Urban Fringe of a Midsize City
We can then conclude that most of these Museums are located in a place that is considered as Urban Fringe of a Midsize City.
summary(clean_newmaz$Income)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 5272 109962253 203810 83181439574
The average income received by the MAZ in the United States are around $109,962,253. But if we take a look on the Median value, the difference are way too drastic with only $5,272. This is may be caused by the highest income is $83,181,439,574 and the lowest amount is $0 !
boxplot(clean_newmaz$Income)According to the Museum Types
mtype_income <- as.data.frame(xtabs(clean_newmaz$Revenue ~ clean_newmaz$Museum_Type))
mtype_income <- mtype_income[order(x = mtype_income$Freq, decreasing = T), ]
mtype_incomeBased on the cross tabulations above, we can see that Art Museum is the type of museums that received the highest amount of Income during 2014.
Museums’ with the highest Income
top_income <- clean_newmaz[clean_newmaz$Income == max(clean_newmaz$Income),]
count(top_income)There are 20 museums that received the highest Income through out the year. How about which states received these highest Income? Is it going to be New York…?
States with the highest income
inct20 <- as.data.frame(xtabs(top_income$Income ~ top_income$Administrative_State))
inct20[inct20$Freq >0, ]Apparently, amongst the highest income received by these museums, only 2 states are included, those are AZ and MA.
These Museums received 0 income…
low_income <- clean_newmaz[clean_newmaz$Income == min(clean_newmaz$Income),]
count(low_income)There are 10733 Museums that received 0 income during 2014. This could means that there are 10,733 non-profits museum accross The United States
summary(clean_newmaz$Revenue)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2127393 0 3307 20976047 167696 5840349457
minimum_rev <- clean_newmaz[clean_newmaz$Revenue == min(clean_newmaz$Revenue),]
minimum_rev$Museum_Name## [1] "SHERIDAN HERTIAGE CENTER"
Wow there are some museums that received MINUS for their revenue. Yes, apparently SHERIDAN HERTIAGE CENTER received the lowest revenue over all other Museums.
Let’s find the top 5 revenue.
revt20 <- clean_newmaz[clean_newmaz$Revenue == max(clean_newmaz$Revenue),]
revt20There are also 20 museums with the highest Revenue. Let’s check if they are the same thing as the top 20 Incomes.
unique(top_income$Museum_Name == revt20$Museum_Name)## [1] TRUE
And the answer is, yes. They are the same thing. Here are the list of the Museums Name:
unique(revt20$Museum_Name)## [1] "FRED LAWRENCE WHIPPLE OBSERVATORY"
## [2] "ARNOLD ARBORETUM OF HARVARD UNIVERSITY JAMAICA PLAIN"
## [3] "AUTHUR M. SACKLER MUSEUM"
## [4] "BUSCH-REISINGER MUSEUM"
## [5] "CENTER FOR CONSERVATION AND TECHNICAL STUDIES"
## [6] "COLLECTION OF SCIENTIFIC INSTRUMENTS"
## [7] "FISHER MUSEUM"
## [8] "FOGG ART MUSEUM"
## [9] "GENERAL ARTEMAS WARD HOUSE"
## [10] "HARVARD FOREST"
## [11] "HARVARD UNIVERSITY ART MUSEUMS"
## [12] "HARVARD UNIVERSITY BOTANICAL MUSEUM"
## [13] "HARVARD UNIVERSITY HERBARIA"
## [14] "HARVARD UNIVERSITY MINERALOGICAL AND GEOLOGICAL MUSEUM"
## [15] "HARVARD UNIVERSITY MUSEUM OF COMPARATIVE ZOOLOGY"
## [16] "HARVARD UNIVERSITY MUSEUMS OF NATURAL HISTORY"
## [17] "HARVARD UNIVERSITY PEABODY MUSEUM OF ARCHAEOLOGY AND ETHNOLOGY"
## [18] "HARVARD-SMITHSONIAN CENTER FOR ASTROPHYSICS"
## [19] "SEMITIC MUSEUM"
## [20] "WARREN ANATOMICAL MUSEUM"
summary(clean_newmaz$Tax_Period)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 200212 201312 201312 201350 201409 201504
summary(newmaz$Tax_Period)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 199906 201312 201312 201347 201408 201504 9792
As we can see here we compare the data that has NA value and the clean version of it. The 5 numbers here look almost a like which a slight difference on the minimum, mean and Q3 values. Here we can conclude that There are 9,792 museums that have not inputted their data yet.
hightaxes <- clean_newmaz[clean_newmaz$Tax_Period == 201504 , ]
hightaxeslowtaxes <- clean_newmaz[clean_newmaz$Tax_Period == 200212 , ]
lowtaxesWow it turns out that BUFFALO SOLDIERS OF THE ARIZONA TERRITORY, which located in AZ has to pay the highest taxes out of the others. Meanwhile, NEW BEDFORD MUSEUM OF GLASS located in MA, pay the lowest tax.
state_revenue <- aggregate(formula = Revenue ~ Administrative_State, data = clean_newmaz, FUN = sum)
state_revenue <- state_revenue[order(state_revenue$Revenue, decreasing = T), ]
head(state_revenue, n = 5)Out of all 50 States exist in The US, Massachusetts received the largest revenues from their Museums with $122,543,085,157, followed by California, New York, Connecticut, and Illinois.
tail(state_revenue, n = 5)On the other hand, North Dakota received the least amount of revenue with the total of $11,360,290, followed by Wyoming, Nevada, Alaska, and New Mexico.
city_income <- aggregate(formula = Income ~ Administrative_City, data = clean_newmaz, FUN = sum)
city_income <- city_income[order(city_income$Income, decreasing = T),]
head(city_income, n = 5)length(unique(clean_newmaz$Administrative_City))## [1] 7381
count(city_income[city_income$Income == 0, ])There are 3,336 cities that receive no income from Museums, Aquariums, and Zoos. As previously mentioned, this could means that these out of 7,381 cities, these 3,336 held the non-profit Museums.
We’ve talked about Income, Revenues and Tax Period. But what are the correlation between these threes?
cor(clean_newmaz$Income, clean_newmaz$Revenue)## [1] 0.7999154
ggcorr(clean_newmaz, label = T,)## Warning in ggcorr(clean_newmaz, label = T, ): data in column(s)
## 'Museum_Name', 'Museum_Type', 'Administrative_City', 'Administrative_State',
## 'Locale_Code_NCES', 'Region-Code_AAM' are not numeric and were ignored
First of all we could disregard the Museum_ID because it means nothing. Now, it looks like the correlation between Revenues and Incomes are strong positive with 0.8 as its value. This means that the higher the income, the higher revenue as well. We can also conclude that there are no correlation between the tax period and both income and revenue.
As we can see, out of 33,072 observations we have approximately 10,782 data that has NA values, which then lead us to disregard some of them. Therefore, if the data provided are more complete, it could open more possibilities for them to be explored.
We know that 55.55% of Museums in The United States are categorized as HISTORIC PRESERVATION, and Museums are still dominating the country rather than Aquariums and Zoos. We have explored many other things including the income and revenues that these Museums, Aquariums and Zoos received, the taxes they paid and how much taxes they have to pay. We also now know that there are some MAZ that not only receive less revenues than the others but losing money!
Thank you for taking the time to read my very first report Rpubs, hope you like it and I am open to many feedbacks! -gn