This project serves as the first phase in a broader study of 501c3 (tax exempt) not-for-profit organizations operating in the state of Vermont. It represents a proof-of-concept for collecting, organizing, and evaluating financial data and related metrics for these organizations. And an initial attempt to identify emergent features in their operations and leadership that, when studied in aggregate, can provide information to understand their status, distribution, resources, and collaborative potential at a regional scale.
The information that informs this research is drawn from annual financial disclosures to the IRS by the 501c3 organizations. These disclosures include FORM 990, FORM 990EZ, FORM 990N, and Form 990PF along with their respective schedules. While these are public documents, options for collating them in a systematic manner (i.e., across multiple organizations by sector, geography, range-of-time, etc.) are limited - absent a fee-for-service platform.
This report is divided into two .rmd files (a and b) and outlines steps for accessing and decomposing 501c3 disclosures via. web-based search queries and scraping techniques - in batch. It also provides initial results of exploratory data analysis using Form 990 information, focusing on cross-sector comparisons, as well as description of challenges inherent to this approach.
The bulk of data collected in this manner were identified using Propublica’s Nonprofit Explorer (https://projects.propublica.org/nonprofits/) and sequentially extracted as unstructured XML pages served by Amazon Web Server. From search to analysis, the following steps are included in the report:
Part a – see project file: Data607_Proj4a_Connin
Part b
R packages/libraries.
Establish file paths for text analysis
#Create collector dataframe for string data extracted from raw 990 text documents
org_990 <- data.frame(matrix(ncol=25))
colnames(org_990)<-c("xml_num","organization", "tax_year", "form", "address", "city", "website", "ein", "start_yr", "mission", "principal_officer", "employee_count", "volunteer_count", "PY_total_revenue", "PY_total_expenses", "PY_net", "CY_total_revenue", "CY_total_expenses", "CY_net", "CY_total_EOY_assets", "CY_total_EOY_liabilities", "CY_fund_balance", "CY_total_salaries", "CY_grants_contributions", "CY_program_service_rev")
for(file in arts_files){
string<-readLines(file, skipNul = TRUE) #we only the lines that have text. Blank lines still counted.
if(length(string)<3) next #skip files that are NA
string <- read_lines(string,skip=16,n_max = 97) #skip lines to target select information
string<-glue_collapse(string, sep="\n")
xml_num<-file%>%str_extract("\\d+")
organization<-string%>%str_extract("(?<=<BusinessNameLine1Txt>).+[:alnum:](?=<)")
tax_year<-string%>%str_extract("(?<=<TaxYr>).+[:alnum:](?=<)")
form<-string%>%str_extract("(?<=<ReturnTypeCd>).+[:alnum:](?=<)")
address<-string%>%str_extract("(?<=<AddressLine1Txt>).+[:alnum:](?=<)")
city<-string%>%str_extract("(?<=<CityNm>).+[:alnum:](?=<)")
website<-string%>%str_extract("(?<=<WebsiteAddressTxt>).+[:alnum:](?=<)")
ein<-string%>%str_extract("(?<=<EIN>).+[:alnum:](?=<)")
start_yr<-string%>%str_extract("(?<=<FormationYr>).+[:alnum:](?=<)")
mission<-string%>%str_extract("(?<=<MissionDesc>).+[:alnum:].(?=<)")
principal_officer<-string%>%str_extract("(?<=<PrincipalOfficerNm>).+[:alnum:](?=<)")
employee_count<-string%>%str_extract("(?<=<TotalEmployeeCnt>).+[:alnum:](?=<)")
volunteer_count<-string%>%str_extract("(?<=<TotalVolunteersCnt>).+[:alnum:](?=<)")
PY_total_revenue<-string%>%str_extract("(?<=<PYTotalRevenueAmt>).+[:alnum:](?=<)")
PY_total_expenses<-string%>%str_extract("(?<=<PYTotalExpensesAmt>).+[:alnum:](?=<)")
PY_net<-string%>%str_extract("(?<=<PYRevenuesLessExpensesAmt>).+[:alnum:](?=<)")
CY_total_revenue<-string%>%str_extract("(?<=<CYTotalRevenueAmt>).+[:alnum:](?=<)")
CY_total_expenses<-string%>%str_extract("(?<=<PYTotalExpensesAmt>).+[:alnum:](?=<)")
CY_net<-string%>%str_extract("(?<=<CYRevenuesLessExpensesAmt>).+[:alnum:](?=<)")
CY_total_EOY_assets<-string%>%str_extract("(?<=<TotalAssetsEOYAmt>).+[:alnum:](?=<)")
CY_total_EOY_liabilities<-string%>%str_extract("(?<=<TotalLiabilitiesEOYAmt>).+[:alnum:](?=<)")
CY_fund_balance<-string%>%str_extract("(?<=<NetAssetsOrFundBalancesEOYAmt>).+[:alnum:](?=<)")
CY_total_salaries<-string%>%str_extract("(?<=<CYSalariesCompEmpBnftPaidAmt>).+[:alnum:](?=<)")
CY_grants_contributions<-string%>%str_extract("(?<=<CYContributionsGrantsAmt>).+[:alnum:](?=<)")
CY_program_service_rev<-string%>%str_extract("(?<=<CYProgramServiceRevenueAmt>).+[:alnum:](?=<)")
df <-data.frame(xml_num, organization, tax_year, form, address, city, website, ein, start_yr, mission, principal_officer, employee_count, volunteer_count, PY_total_revenue, PY_total_expenses, PY_net, CY_total_revenue, CY_total_expenses, CY_net, CY_total_EOY_assets, CY_total_EOY_liabilities, CY_fund_balance, CY_total_salaries, CY_grants_contributions, CY_program_service_rev)%>%trimws()
org_990<-rbind(org_990, df)
}
org_990$xml_num<-as.numeric(org_990$xml_num) # convert from char to dbl for sorting
org_990%<>%arrange(org_990, (xml_num))
View(org_990)
#write_csv(org_990,"arts990.csv")
Create an aggregated dataset of 990 information.
Note: for demonstrative purposes this report focuses on the 990 Form and does not include data from forms: 990EZ, 990PF, 990N. Additional scripts will be required to differentiate and parse these forms. In addition, this report only includes the Section 1 summary data from the 990’s and is limited to the most recent filing year listed for each organization. Future work will focus on extracting information from other 990 sections, schedules, and reporting periods.
The following code chunk highlights organization counts and start-dates by category from the mid-19th century to present.
#create data subset for analysis and limit forms to 2019 or 2018 filing
vtnps<- data.frame(rbind(env, educ, inter, relig, heal, arts, publ))%>%
relocate(category, .after=organization)%>%
filter(form %in% "990")%>%
filter(tax_year==2019 | tax_year==2018)
# show basic stats and df composition
skim(vtnps)
Name | vtnps |
Number of rows | 597 |
Number of columns | 26 |
_______________________ | |
Column type frequency: | |
character | 9 |
numeric | 17 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
organization | 0 | 1.00 | 8 | 70 | 0 | 522 | 0 |
category | 0 | 1.00 | 6 | 23 | 0 | 7 | 0 |
form | 0 | 1.00 | 3 | 3 | 0 | 1 | 0 |
address | 0 | 1.00 | 6 | 35 | 0 | 499 | 0 |
city | 0 | 1.00 | 5 | 20 | 0 | 170 | 0 |
website | 39 | 0.93 | 3 | 44 | 0 | 414 | 0 |
ein | 0 | 1.00 | 9 | 9 | 0 | 524 | 0 |
mission | 195 | 0.67 | 9 | 1000 | 0 | 352 | 0 |
principal_officer | 54 | 0.91 | 6 | 26 | 0 | 460 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
xml_num | 0 | 1.00 | 270.35 | 168.92 | 3 | 132.00 | 250.0 | 390.0 | 679 | <U+2587><U+2587><U+2586><U+2585><U+2582> |
tax_year | 0 | 1.00 | 2018.45 | 0.50 | 2018 | 2018.00 | 2018.0 | 2019.0 | 2019 | <U+2587><U+2581><U+2581><U+2581><U+2586> |
start_yr | 24 | 0.96 | 1985.44 | 32.12 | 1819 | 1975.00 | 1995.0 | 2005.0 | 2019 | <U+2581><U+2581><U+2581><U+2582><U+2587> |
employee_count | 312 | 0.48 | 164.26 | 587.27 | 10 | 18.00 | 30.0 | 92.0 | 8686 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
volunteer_count | 284 | 0.52 | 302.11 | 1303.67 | 10 | 25.00 | 50.0 | 117.0 | 14000 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
PY_total_revenue | 9 | 0.98 | 8927343.18 | 63662126.87 | -16281 | 277462.25 | 592029.5 | 1717746.2 | 1433600411 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
PY_total_expenses | 9 | 0.98 | 8278029.14 | 60609912.23 | -1005823 | 238535.75 | 550441.0 | 1421340.8 | 1360613119 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
PY_net | 13 | 0.98 | 653761.39 | 4686280.87 | -7299122 | -19803.75 | 22864.5 | 137345.2 | 72987292 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
CY_total_revenue | 0 | 1.00 | 8980391.21 | 67253585.89 | -33419 | 272757.00 | 603793.0 | 1825858.0 | 1529529583 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
CY_total_expenses | 9 | 0.98 | 8278029.14 | 60609912.23 | -1005823 | 238535.75 | 550441.0 | 1421340.8 | 1360613119 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
CY_net | 3 | 0.99 | 279235.25 | 3158780.06 | -25315580 | -35786.25 | 12086.0 | 111772.0 | 62328659 | <U+2581><U+2587><U+2581><U+2581><U+2581> |
CY_total_EOY_assets | 8 | 0.99 | 13331113.04 | 83460700.94 | -36495 | 299968.00 | 1025034.0 | 4225068.0 | 1695000215 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
CY_total_EOY_liabilities | 94 | 0.84 | 5643250.12 | 44897442.59 | 39 | 20406.00 | 120996.0 | 690764.0 | 723839777 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
CY_fund_balance | 134 | 0.78 | 5966779.33 | 49711368.60 | -3444185 | 176380.00 | 688895.0 | 2253933.5 | 1033709335 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
CY_total_salaries | 80 | 0.87 | 5650006.93 | 40412075.85 | 760 | 143793.00 | 324377.0 | 973829.0 | 832147624 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
CY_grants_contributions | 39 | 0.93 | 1036136.54 | 2952325.40 | 17 | 94915.75 | 272284.5 | 746145.0 | 40996621 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
CY_program_service_rev | 130 | 0.78 | 9805485.16 | 73336731.06 | 25 | 64017.00 | 262558.0 | 1024773.0 | 1477791395 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
# glimpse dataset
glimpse(vtnps)
## Rows: 597
## Columns: 26
## $ xml_num <dbl> 9, 25, 37, 38, 49, 57, 60, 62, 63, 72, 76, 90~
## $ organization <chr> "Clean Energy States Alliance Inc", "GUANACAS~
## $ category <chr> "environment_animals", "environment_animals",~
## $ tax_year <dbl> 2019, 2018, 2019, 2018, 2019, 2018, 2018, 201~
## $ form <chr> "990", "990", "990", "990", "990", "990", "99~
## $ address <chr> "50 State Street", "4780 MAIN ROAD", "PO Box ~
## $ city <chr> "Montpelier", "HUNTINGTON", "Charlotte", "Bro~
## $ website <chr> "www.cesa.org", "www.gdfcf.org", "www.lewiscr~
## $ ein <chr> "270029803", "943280315", "030349148", "03600~
## $ start_yr <dbl> 2002, 1997, 1995, 1915, 1992, 2000, 2016, 197~
## $ mission <chr> "Technical support to States' clean energy ag~
## $ principal_officer <chr> "Anthony Vargo", NA, "Martha Illick", "SUSAN ~
## $ employee_count <dbl> NA, NA, NA, 18, 31, NA, NA, 47, 19, 21, 16, N~
## $ volunteer_count <dbl> 14, 15, 74, NA, NA, NA, NA, 180, 16, 83, NA, ~
## $ PY_total_revenue <dbl> 1565197, 6007561, 156205, 458419, 7980411, 11~
## $ PY_total_expenses <dbl> 1395391, 2166001, 128464, 488587, 7726366, 71~
## $ PY_net <dbl> 169806, 3841560, 27741, -30168, 254045, 38863~
## $ CY_total_revenue <dbl> 1541453, 4011292, 226889, 393201, 7861211, 18~
## $ CY_total_expenses <dbl> 1395391, 2166001, 128464, 488587, 7726366, 71~
## $ CY_net <dbl> -17351, 1794998, 33768, -144257, -748228, 136~
## $ CY_total_EOY_assets <dbl> 1094126, 17111063, 95950, 3401790, 6736910, 2~
## $ CY_total_EOY_liabilities <dbl> 269797, 181243, NA, 7082, 1725796, NA, 22502,~
## $ CY_fund_balance <dbl> 824329, 16929820, NA, 3394708, 5011114, 22175~
## $ CY_total_salaries <dbl> 838077, 760634, 44213, 238038, 4465965, NA, 1~
## $ CY_grants_contributions <dbl> 1251720, 3534249, 224829, 224514, 6888145, 18~
## $ CY_program_service_rev <dbl> 284379, 271040, 1556, 69841, 871478, NA, 4881~
# plot history of start year with annual count -- w/o category
vtnps %>%
ggplot(aes(x=start_yr)) +
geom_histogram(alpha=0.6, binwidth = 1, fill="dark blue") +
labs( x='\nStart Year', y = 'Total Count', title = "Development of 501c3 Non-Profits in Vermont", subtitle="Mid-19th Century to Present")+
theme_minimal()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
theme(axis.text.x = element_text(angle = 45, hjust = 0, vjust = 1, size=10))
#profile organization start-yr and number by category
vtnps%>%
filter(start_yr>1850)%>%
count(start_yr,category) %>%
ggplot()+geom_col(aes(x=start_yr,y=n,fill=category))+
labs( x='\nStart Year', y = 'Total Count', title = "Development of 501c3 Non-Profits in Vermont", subtitle="Mid-19th Century to Present")
#density of start over time
ggplot(data=vtnps, aes(x=start_yr, group=category, fill=category)) +
geom_density(adjust=1.5, alpha=.3) +
labs( x='\nStart Year', y = 'Density', title = "Development of 501c3 Non-Profits in Vermont", subtitle="Mid-19th Century to Present")+
theme_ipsum()
# annual addition of new nonprofits since 1950
vtnps %>%
filter(start_yr>1950)%>%
ggplot(aes(x=start_yr, color=category, fill=category, alpha=.4)) +
geom_histogram(alpha=0.6, binwidth = 1) +
scale_fill_viridis(discrete=TRUE, option="D") +
scale_color_viridis(discrete=TRUE, option="D") +
labs( x='\nStart Year', y = 'Total Addition\n', title = "Annual Addition of 501c3 Non-Profits in Vermont", subtitle="1950 to Present")+
theme_minimal()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
facet_wrap(.~category, ncol=2)
#annual addition of new nonprofits since 2000
vtnps %>%
filter(start_yr>2000)%>%
ggplot(aes(x=start_yr, color=category, fill=category, alpha=.4)) +
geom_histogram(alpha=0.6, binwidth = 1) +
scale_fill_viridis(discrete=TRUE, option="D") +
scale_color_viridis(discrete=TRUE, option="D") +
labs( x='\nStart Year', y = 'Total Addition\n', title = "Annual Addition of 501c3 Non-Profits in Vermont", subtitle="2000 to Present")+
theme_minimal()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
theme(axis.text.x = element_text(angle = 45, hjust = 0, vjust = 1, size=10))
facet_wrap(~category, ncol=2)
## <ggproto object: Class FacetWrap, Facet, gg>
## compute_layout: function
## draw_back: function
## draw_front: function
## draw_labels: function
## draw_panels: function
## finish_data: function
## init_scales: function
## map_data: function
## params: list
## setup_data: function
## setup_params: function
## shrink: TRUE
## train_scales: function
## vars: function
## super: <ggproto object: Class FacetWrap, Facet, gg>
The following code chunk focuses on employee count data by category
#summary stats for employee count
vtnps$employee_count%>%summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 10.0 18.0 30.0 164.3 92.0 8686.0 312
# Frequency of employee counts in organization
vtnps%>%
ggplot(aes(x=employee_count))+
geom_histogram(color="dark blue", fill="dark blue", alpha=.7)+
labs( x='\nTotal Employees in Organization', y = 'Frequency\n', title = "Employee Counts for 501c3 Non-Profits in Vermont")+
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# quantiles for total employee counts
vtnps%>%group_by(category)%>%
group_by(category)%>%
mutate(category=fct_reorder(category, employee_count, .fun='median'))%>%
ggplot( aes(x=reorder(category, employee_count), y=employee_count)) +
geom_boxplot(fill="dark blue", alpha=.3, outlier.colour = "red")+
coord_flip()+
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, hjust= 1, size=10))
# Top 20 employee counts by organization
vtnps%>%
top_n(20, employee_count) %>%
ggplot(aes(x=employee_count, y=reorder(organization, employee_count), fill=category))+
geom_col()+
scale_fill_manual(values = c("#C3D7A4","#4E84C4"))+
labs( x='\nTotal Employees in Organization', y = 'Organization\n', title = "Top 20 Employee Counts", subtitle= "501c3 Non-Profits in Vermont")+
theme_minimal()
# Limit analysis to organizations below 100 employees
vtnps%>%
filter(employee_count<100)%>%
ggplot(aes(x=employee_count)) +
geom_boxplot()+
theme_minimal()
#Distribution of employee counts <1000
vtnps%>%
filter(employee_count<1000)%>%
ggplot(aes(x=employee_count)) +
geom_bar(colour="blue", fill="dark blue", alpha=.3)+
labs( x="Number of Employees", y = "Number of Organizations\n", title = "Distribution of Employee Counts", subtitle="Organizations with 1000 or Fewer Employees")+
theme_minimal()+
theme(panel.grid = element_blank())
# quantiles for organizations with fewer than 1000 employees
vtnps%>%
group_by(category)%>%
filter(employee_count<1000)%>%
ggplot(aes(x=reorder(category, employee_count), y=employee_count))+
geom_boxplot(fill="dark blue", alpha=.3, outlier.colour = "red")+
labs( x="", y = "\n\nNumber of Employees", title = "Total Employees by 501c3 Category", subtitle="Organizations with 1000 or Fewer Employees")+
coord_flip()+
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, hjust= 1, size=10))
The following code chunk focuses on salary data by category
# average salary by category --> excludes salaries over 200,000
vtnps%>%
mutate(avg_salary = CY_total_salaries/employee_count)%>%
filter(avg_salary<200000)%>%
group_by(category)%>%
mutate(category=fct_reorder(category, avg_salary, .fun='median'))%>%
ggplot( aes(x=reorder(category, avg_salary), y=avg_salary)) +
geom_boxplot(fill="dark green", alpha=.3, outlier.colour = "blue")+
labs( x="", y = "\n\nAverage Salary", title = "Average Salaries by 501c3 Category", subtitle="Salaries less than $200,000\n\n")+
coord_flip()+
theme_minimal()+
theme(axis.title.y = element_blank())+
theme(axis.text.x = element_text(angle = 45, hjust= 1, size=10))
The following code chunk highights selected financial data by category
#Ratio of liabilities assets >1 indicates an excess of debt -- the proportion of a company’s assets that are financed by debt. A ratio greater than 1 shows that a considerable portion of debt is funded by assets. In other words, the company has more liabilities than assets. A high ratio also indicates that a company may be putting itself at risk of default on its loans if interest rates were to rise suddenly.
vtnps%>%
mutate(liability_ratio = CY_total_EOY_liabilities/CY_total_EOY_assets)%>%
filter(liability_ratio<50)%>% #note, one outlier at 400+
ggplot(aes(x=category, y=liability_ratio))+
geom_point(color='darkblue')+
coord_flip()+
geom_hline(yintercept = 1, color="red")+
labs( x="501c3 Category", y = "\n\nLiability Ratio", title = "Liability Ratios by 501c3 Category\n", subtitle="A ratio >1 indicates an excess of debt over assets\n\n")+
theme_minimal()+
theme(panel.grid = element_blank())+
theme(axis.title.y = element_blank())
lratio<-vtnps%>%
mutate(liability_ratio = CY_total_EOY_liabilities/CY_total_EOY_assets)
labove<-length(which(lratio$liability_ratio > 1))
lbelow<-length(which(lratio$liability_ratio <= 1))
(lprop<-round((labove/lbelow)*100, 0))
## [1] 5
glue("The number of organizations with a liability >1 and <=1 is {labove} and {lbelow}, respectively. The latter represents {lprop}% of the total count.")
## The number of organizations with a liability >1 and <=1 is 22 and 480, respectively. The latter represents 5% of the total count.
#Ratio of revenue to expenses between yrs. net operating ratio of income to expenses, which tells you how efficiently the organization is using its money to fund operations
vtnps%>%
mutate(current_net=CY_total_revenue/CY_total_expenses)%>%
mutate(prior_net=PY_total_revenue/PY_total_expenses)%>%
mutate(net_annual_diff=current_net-prior_net)%>%
group_by(category)%>%
filter(current_net<100)%>%
ggplot(aes(x=category, net_annual_diff))+
geom_point()+
geom_hline(yintercept = 0, color="red")+
coord_flip()+
labs(x="", y= "Net Difference in Revenue:Expenses", title=("\n\nDifference in end-of-year revenue:expense"), subtitle=('Last two reporting years\n\n'))+
theme_minimal()+
theme(panel.grid = element_blank())+
theme(axis.title.y = element_blank())
#further comparison of revenue to expense
adiff<-vtnps%>%
mutate(current_net=CY_total_revenue/CY_total_expenses)%>%
mutate(prior_net=PY_total_revenue/PY_total_expenses)%>%
mutate(net_annual_diff=current_net-prior_net)
(adiff_a<-length(which(adiff$net_annual_diff > 0)))
## [1] 339
(adiff_b<-length(which(adiff$net_annual_diff <= 0)))
## [1] 249
(adiff_r<-round((adiff_b/adiff_a)*100, 0))
## [1] 73
a_diff<-vtnps%>%
mutate(current_net=CY_total_revenue/CY_total_expenses)%>%
mutate(prior_net=PY_total_revenue/PY_total_expenses)%>%
mutate(net_annual_diff=current_net-prior_net)
labove<-length(which(lratio$liability_ratio > 1))
lbelow<-length(which(lratio$liability_ratio <= 1))
(lprop<-round((labove/lbelow)*100, 0))
## [1] 5
#per capita net revenue by organization category
vtnps%>%
mutate(net=CY_total_revenue-CY_total_expenses)%>%
mutate(rev_person=net/employee_count)%>%
group_by(category)%>%
filter(net<500000)%>%
ggplot(aes(x=reorder(category, rev_person), y=rev_person))+
geom_boxplot()+
geom_hline(yintercept = 0, color="red")+
coord_flip()+
labs(x="Per Capita Net", y= "", title=("Current Year Per Capita Net Revenue\n"), subtitle=('Organizations with Net Revenue < $500,000\n\n'))+
theme_minimal()+
theme(panel.grid = element_blank())+
theme(axis.title.y = element_blank())
## Warning: Removed 282 rows containing non-finite values (stat_boxplot).
# preliminary comparison of NY revenue and employee metrics
vtnps%>%
filter(CY_total_salaries<200000000)%>%
ggplot(aes(x=employee_count, y=CY_total_revenue))+
geom_point()+
theme_minimal()
## Warning: Removed 232 rows containing missing values (geom_point).
vtnps%>%
filter(CY_total_salaries<200000000)%>%
ggplot(aes(x=CY_total_salaries, y=CY_total_revenue))+
geom_point()+
theme_minimal()
##Conclusions and Recommendations
This project will continue with addition work focused on filling in missing information as well as constructing a data-table that holds information from multiple 990 forms and multiple years (2013 to present). The latter steps will require further scripting and automation to streamline data processing steps. It may also be advantageous to construct a relational database to facilitate data exploration and analysis.
In addition, additional scripts will be required to effectively extract key financial information from 990 forms that were not included in this initial phase. These features should include the following for 501c3 metrics:
Other metrics include revenue reliability (across years) and revenue surplus.
The following chunk provides code for accessing 990 related information using Propublica’s API. In this context, select data is available via. query in JSON format.
res<-httr::GET(url)
restxt<-httr::content(res, as = ‘text’)
j <- fromJSON(restxt)
glimpse(j)
length(str(j$organizations[[1]]))
df_json2 <- fromJSON(json2)
str(j, nest.lev)
ny <- c() for (i in length(j)) { for (x in i) { print(x) } }